Digital Geography

18. May 2015

How to organize geodata storage?

In the past I’ve learned one thing: the only chance to do efficient geodata-projects is to have a clean data structure.
In this post I’ll show you some tools and share my thoughts about organizing (geo)data to have more efficient project-workflows. After that I would be very interested in your solutions and approaches to do smart data-projects.



1. Think about data-types, provider or/and topic, local geodata storage

At the beginning of small projects I usually make one decision. I determine the main criteria for storing my project-files. Relating to the main project-goals I either decide if I segmenting the data in their data-types (e.g. raster-datasets, vector-datasets, numeric-datasets), or store them separated by data-provider (e.g. customer-datasets, governmental-datasets), or split the datasets into topic-segments (e.g. background-map, roads, rivers, poi-datasets).
As result I have a folder-structure on my local workstation. Often, especially when projects gonna grow up, this technique runs the risk of messing up given data structures.
So bigger projects should be organized on database-level (e.g. PostGIS) to have a good overview, data-filtering possibilities and some great indexing features to improve speed of data-analyses.

2. Geodata databases

If you work on a professional level, sooner or later there is no getting around geodatabase solutions to store your project geodata. If you’re totally new to this term, I suggest to read this short article from my dear colleague Riccardo about “Geodatabases: a little insight“. If you’re interested in setting up such a geodatabase, please try this tutorial also from Riccardo: “OpenSource QGIS + PostGIS installation: “the Windows way”“.


PostGIS is “the” open source geodatabase extension for PostgreSQL


In general geodatabases allows you to store geodata in a database format with its geometry information, add (connect) the data directly into your desktop GIS-Software and process the geodata directly on the database server. The last point is a very powerful one. With e.g. SQL-queries you are able to select your data needs directly from the source and save them as intermediate datasets. Of course you can perform really smart requests and get out final results out of the database.
From my point of view you have many advantages with working on a geodatabase-level. The biggest disadvantages are the hard learning curve to handle database requests and the database configuration in advance.

Here you are:
PostgreSQL with PostGIS
ESRI GIS Tools for Hadoop
more

3. Geodata clouds

After the local- and dedicated database way of storing geodata, cloud storing is the third possibility.

Online services like GeoCloud2 or QGIS Cloud allows you to upload spatial data, manage data, edit data online and combine the data with fancy online techniques like JavaScript magic, the speed enhancement of a CDN content delivery network and so on.

Maybe this could be a good way to work together with people who havn’t got any GIS-skills, but are familiar with data manipulation within web interfaces.

Here you are:
QGIS Cloud
MapCentia GeoCloud2

4. So what?

  • What do you think about these three possibilities of storing geodata?
  • What’s your “best case” solution?
  • Which tools do you use for organizing geodata in projects?
  • How do you deal with participants without GIS-skills?

I’m very interested in your comments!

  • CzendaZdenda

    I like databases, within geoworld I use this approach especially if I need to analyze/process data through long period of time – for me it seems to be faster getting for example 100 rows from database (well designed db and query) than load 100 files into memory

  • Martin Sudmanns

    Good summary of the current possibilities, however this decision how to store the data needs a lot more factors to be considered than the size of the project or how “professional” you are working (whatever this is). In my opinion the decision comes with the nature of the data and the project.

    A dbms has advantages if the data is very large or needs to be joined very often. Another advantages are the concurrent access of multiple users, finer granularity of access permissions, methods for preventing data loss, better handling of (automatic) inserts of new data and/or updates and better performance of processing the data with the functions which are provided by the database.
    The handling of files seems to be more user friendly compared to a database management system (dbms). Or users are more used to it. I don’t know. But, at least regarding vector data, if you already have set up a dbms server, why not use it in smaller projects, too. You can always easily export it into a file based format if you need to.
    The largest disadvantage of dbms in my opinion is the lack of common standards. There is a SQL standard, but each dbms comes with its own dialect. The effect is that it is not possible to share or transfer data between different dbms directly without using additional, static software or without using temporal storage e.g. in a file. Another disadvantage is the poor support for raster data, but this will be just a matter of time since there is a lot of development going on currently in this field.
    Both approaches needs the same discipline regarding the data handling. You can mess up a file based storage, but you also can mess up a database.

    You should also consider that the lifetime of the data often does not end with the end of a project. In scientific projects you are often required to keep the data a certain time which means that the long-term storage might also be a factor. There are disadvantages and advantages of both systems.

    I did not consider the geodata cloud since I do not really have experience with it and in my opinion it is not an alternative for storing data and using it in your project. It seems to be an option if you want to share the results of your project. There are also many more things to consider regarding the security and safety of your data when you store it in an external system.

  • Dana Diotte

    I’ve been using Geogig as a version control/storage solution for small projects. I push it to a remote repository for back up on my personal sever as well. I like Geogig for the version control system however it is limited to the datatypes in can store and I mostly store small data sets. Haven’t tried it with large data sets yet. It also works with PostGIS.

    • Very interesting. Would you like to write a little GeoGig description for digital-geography.com?

  • mhoegh

    Hi, I’m the creator of MapCentia GC2. One of the main reasons to create GC2, were to flatten the learning curve for among other things PostGIS, so that “ordinary” GIS users can utilise the technology. GC2 is used today for both large and small projects. An example is a municipality, which for each small project creates a PostgreSQL schema where in all related data is stored.

    Most organizations that use GC2, have their own local installation of the system.

  • Pitasouvlaki

    these are good points, but things get somehow complicated when dealing with raster datasets of projects. A Landsat-8 OLI image might be used in combination with UAV, cadastre maps, land cover maps for a given project, and other projects might use one or several same products. In addition, some products have processing levels that are important to track or search. So in my experience the effective storing and indexing of rasters is an issue

    • Do you have any approaches to deal with storing big raster datasets?

      • Pitasouvlaki

        We are currently upgrading our SDI and spatial database, we work with ArcSDE and found the mosaic dataset to generally work well for our needs. However, we found the indexing and cataloging to be the larger issue, especially when the nature of the projects are diverse. So the answer is that I am still experimenting to conclude on a best-fit approach. Would be nice to see some robust opensource solutions on this matter.

        • Great. Yes, this would be my next question. Are there any known robust open source solutions for this approach?

  • Matt Wilkie

    A very important topic to discuss and think about.

    For my part, the storage place, as in files or geodatabases is in large part inconsequential or academic (not really, but to underscore the main point…). Our office has been using geospatial data for 2 decades. In that time we’ve tried a number of different organizational schemes, ranging from well thought out, meticulously planned and modelled to shoot from the hip and just get the job of the moment done. All of them — all — have been wrong. Wrong in the sense that each create new friction points just as they resolve others. It’s a wicked problem.

    In the end the most important principles we’ve arrived at are:

    – There is no permanent or long term solution. Plan for change.

    – Keep distinct containers for DEVELOPMENT, SHARED and PRODUCTION (perhaps your containers have different names such as PERSONAL, WORKGROUP, ORGANIZATION, EXTERNAL).

    Don’t use dev data in a shared map, and certainly not in a production map. When you are tempted, take the extra time to promote it. You’ll be sorry when you don’t. (Yes I said when, not if.)

    – Container type — shapefiles, csv, file-gdb, geo-gdb, … — varies on a per-project basis — and changes with time, software evolves. Don’t let format be the driver (unnecessarily). Ditto for software and platform.

    – Metadata. Just do it. Even if all you can manage is a quickly scrawled cryptic Readme.txt. of Input, output, purpose. What is the data about? Who was it generated for? What is it built on? (sources) What processing was applied? …to the extent of your time and ability. Just. Put. Something. Down!

    – PDF. It is very true that PDF is where data goes to die, but at the same time it’s an important record of that moment in time. Always create a PDF of the final result (esp. for maps). It will save thousands of reconstruction hours when the projects are broken because of shifting infrastructure and software changes.

    – Give every deliverable a number, “MapID 2015.001.56a”, “DataID Foobaz v22 r5”, something, and make sure it ships with the product. Why? See next point.

    – There is no such thing as a one-off map/data/product. We have a tendency to think “this request is so small, so easy, 15 minutes tops” and neglect save a record as per above. Resist.

    Somebody is going to come back in a week/month/year/decade and request a copy or ask for changes. Good thing you have a MapID and a PDF so you can whip it up in 5 minutes this time instead of doing it all over again, eh?

    – Pick a system, any system, and stick with it for as long as feasible, all other things considered. Your staff and colleagues will appreciate knowing that for years 2001-2006 there is always a readme.txt under “X:ProjectABCdocs” while from 2007-2014 it’s in “\serverMapsProjectXYZ.docx” …

    If you go system-less, or change fitfully every few months searching for a better screwdriver (that doesn’t and can’t exist), that prior work will be ignored and the intellectual capital investment squandered. (Yes, even by you.)

    So, yes, a very important topic to discuss and think about, and we might as well get comfortable with doing so because we’re going to be doing it again next cycle. There is no solution, only responses that sooth the itch for a time. 😉