Digital Geography

12. December 2016

FME and Talend: ETL tools for your spatial data

Spatial data has some formats most of you know (shapefile, geoJSON, TiFF,…). But there is always a wide variety not only in available and used file formats but also in structure of data and formats of the data itself (numbers as text, different separators, etc.). To manage this variety and support the work with a standardized data set most companies use so-called ETL tools to Extract, Transform and Load data. In this article I would like to present you two of them: FME from Safe Software and Talend with Spatial Extension.

What’s it about

As I already used ETL tools in large-scale IT environments at banks I was interested in “What can I do with them in terms of spatial data?”. As I searched the web I found those two candidates: FME and the open source tool talend. Both are somehow different but I wanted to solve two problems:
  1. safe an excel file in three different spatial formats (shapefile, KML and geoJSON)
  2. load data from a google spreadsheet, transform it into a geoJSON and put it onto a server using ftp
So check my little examples.

FME

FME is a product from Safe Software and is somehow the flagship when it comes to ETL processes, data manipulation and process monitoring. The installation is a no-brainer but this does definitely not come for free. An indicator for the price is set at the cloud provider. Despite this price: If you need software like FME for your studies, just drop them a line as they have an academic grant program. For short-term tests they offer a 30 day trial. The software itself can be used in three-way: Desktop version, Server version or Cloud version. You’ll get support for over 350 file formats. The first task was done quite easy as you simply load the excel with a file reader process and attach three different file writers to it.

example view of a a process after run of FME

As you simply need to define the lat/lon columns everything else is done by the program. Take care of string formatting a bit and decide which columns you really need and off you go. One read process, three writers. clean and simple.

The help section of this software is really great and you will find plenty of video tutorials for this baby made by FME.

For the second task I needed some more magic from the so-called Transformer Gallery. There all the fancy processes and tasks are stored like string manipulation, spatial queries and even an IP geocoder or the What3WordsEncoder. After choosing the Google Sheet reader I needed to replace some strings and format my lat/lon columns and defined the lat/lon columns for the Geometry. To work on with it I needed to store it on my hdd. I loaded it back again from this file and saved it again as geoJSON. That was the easy task. The hard one was to get it on the server so a potential webmap application could work with the data. Therefore I needed to use the System Caller Transformer. The normal FTPCaller was not working with our sftp digital-geography server (…). I ended up with calling WinSCP on my Windows PC with some custom commands. Furthermore I needed to start this last process with a so-called creator. I was not able to find out to start a follow-up process if the “storing process” was successful or not. In the end my project looks quite clean:

all processes in one project

The whole thing took me approx. 6 seconds to run. If you want to run tasks periodically you should take a look at FME Server

I like the clean interface and the supporting docs quite a lot and the Data Inspector will support your daily work a lot.

Data Inspector in FME

Off course I was not able to explore the whole power of FME but I think it is a powerful tool. I will definitely play around with it for a while.

Talend Spatial

Talend is a java based platform independent data quality and integration tool. It is somehow “open source”. I was using talend’s Data Integration Tool to work on my problems. As this is a Java application make sure you have a proper JDK installation. Then there is no installation of talend at all as it runs in this Java sandbox… The core of talend is not designed to work with spatial data. But there is this nice github page for a spatial extension. Together with the forum it is a good starting point.

The startup of the program takes a while and the interface is quite slow on my laptop (Core i7, 8GB, SSD) and the suggested download of additional external libraries failed often. I skipped this as talend tells from a look at you processes if a library is missing and asks you to install it prior running your project. The installation of the spatial components is done in 5 minutes:

Once I got set up everything I found out that I need to load the xls input file for my conversion three times…

save xls to shapefile, kml and eojson in Talend

As it is done it FME I also need to define each column in terms of the format but to define the geometry I needed to embed this process s2DPointReplacer. Not very intuitive compared to FME but maybe it’s the burden of the multi-purpose tool.

Talend Spatial project for ETL processes

The other process of using a google spreadsheet for upload to a server as geoJSON was easier compared to FME in terms of running a FTP task but I was too lazy to figure out how to connect with GDRIVE, get the file and so on. I simply made a filefetch of the published google spreadsheet, stored it locally, made the decimal separator replacement and the geometry extraction. The whole process was finished after 3-4 seconds. So it was a bit faster than FME ;-).

I like the multipurpose approach of the tool but the program was a bit buggy (changing language to only English made it crash) and a bit slow from the interface. Nevertheless it is a good alternative if you need to take care of multiple also non spatial ETL tasks.

For script automation: There is a great article in the talend help center.

Other Software

Of course there are alternatives: There is also a list/comparison/test on stackexchange, a list of open source ETL tools. And of course the model builder in ArcGIS and QGIS are quite good alternatives to some external programs.
  • Thorsten Reitz

    In case you want to add some tools to the “Other Software” section, there is also hale studio open source. It’s designed specifically for work with complex data structures and fully supports standards like GML, CityGML and INSPIRE, but also databases and services.