# Importing large dataset with osm2psql?

My question is similar to slow import via osm2pgsql to postgresql database and Optimizing osm2pgsql imports for OSM data but as we currently have quite some problems importing a large set of OSM data I open a new one.

What is the best way to import a large dataset (OSM Export of Europe) into a postgres DB?

Our computer has 32 GB of RAM… so it could use all of that.

We tried a couple of params, but had no success… last try we used the

osm2pgsql -c -S /usr/share/osm2pgsql/default.style --slim -d osm-europe -U postgres -C 25000 europe-latest.osm.pbf

But we ran out of memory even though our sever has 32 GB of RAM available.

pending_ways failed: out of memory for query result (7) Error occurred, cleaning up

How do we improve our import command?

Even if it takes longer… But we need to have the data imported into our postgres DB.

Would you recommend using an EC2 for the task or should our setup work with different parameters?

Your computer should be fine for importing Europe.

Given your dataset size and computer, I'd recommend something like this

I'm assuming that you have an 8 thread CPU, if not, adjust--number-processes.

You don't need 25GB of ram for cache with just Europe.

For Europe, flat nodes should be smaller and faster than in-DB storage of node positions.

If there are still problems, check that you have a version of osm2pgsql using the 64-bit ID space and if so, check your PostgreSQL settings. You could be filling up your disk. Try tuning your settings in postgresql.conf.

I imported a Planet File on a 24Gb Machine (Ubuntu Trusty) with the following…

bzcat planet-latest.osm.bz2 | osm2pgsql --verbose -U YourUser --flat-nodes flat-nodes --keep-coastlines --cache 24000 --hstore --hstore-add-index --tablespace-index pg_default --exclude-invalid-polygon --number-processes 6 --unlogged --cache-strategy dense --extra-attributes --slim -H localhost -d planetosm --style… /my.style planet-latest.osm.bz2

It took approx 5 days, the last half on the database side… rather than actual import

I tuned Postgres with the following for the import

autovacuum = off (default #autovacuum = on) checkpoint_segments = 60 (default #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each) maintenance_work_mem = 256MB ( default #maintenance_work_mem = 16MB # min 1MB) work_mem = 256MB (default #work_mem = 1MB # min 64kB)

## Open Street map issue

So, I downloaded the OSM africa file from Geofabrik, and used the Load OSM data tool, in the Openstreetmap toolbox in ArcGIS. It has been 2 weeks now, and the process is still running. Has anybody else had this issue and is there a faster work around?

It has been 2 weeks now, and the process is still running.

Loading the OSM africa file into a PostGIS database takes less than 1 hour on my laptop.
I think the best way is to use PostGis and keep only the features of interest by (1) modifying the osm2pgsql template, and/or (2) building a custom SQL query to get filtered shapefiles from the database.

Unfortunately there is no way for me to get postgis. Our systems admin requirements are so much that every piece of software usually goes through a 3-4 month approval process at a bare minimum. Could you possibly try on Arc as well, and see how much of a difference is making. IDK why but my computer is only running at about 5% processing.

Hey Mapbox Developer here! We work with both the full planet file and tons of small extracts. With the exception of small city sized extracts I'm going to have to echo what everyone else has said here in that you are going to need to import the data into postgres/postgis. ArcGIS/QGIS can't handle loading the geometry and then creating indexes in memory and will silently freeze up.

Just loaded Africa to give it a spin and it took around 1.2 hours on a clean box to fully load the base dataset and then start appending change files.

You mentioned below that you have a pretty restricted environment in terms of software acquisition, could you by change launch an AWS EC2 or Digital Ocean droplet?

## Importing large dataset with osm2psql? - Geographic Information Systems

geoparsepy is a Python geoparsing library that will extract and disambiguate locations from text. It uses a local OpenStreetMap database which allows very high and unlimited geoparsing throughput, unlike approaches that use a third-party geocoding service (e.g. Google Geocoding API).

Geoparsing is based on named entity matching against OpenStreetMap (OSM) locations. All locations with names that match tokens will be selected from a target text sentence. This will result in a set of OSM locations, all with a common name or name variant, for each token in the text. Geoparsing included the following features:

• token expansion using location name variants (i.e. OSM multi-lingual names, short names and acronyms)
• token expansion using location type variants (e.g. street, st.)
• token filtering of single token location names against WordNet (non-nouns), language specific stoplists and peoples first names (nltk.corpus.names.words()) to reduce false positive matches
• prefix checking when matching in case a first name prefixes a location token(s) to avoid matching peoples full names as locations (e.g. Victoria Derbyshire != Derbyshire)

Location disambiguation is the process of choosing which of a set of possible OSM locations, all with the same name, is the best match. Location disambiguation is based on an evidential approach, with evidential features detailed below in order of importance:

• token subsumption, rejecting smaller phrases over larger ones (e.g. 'New York' will prefer [New York, USA] to [York, UK])
• nearby parent region, preferring locations with a parent region also appearing within a semantic distance (e.g. 'New York in USA' will prefer [New York, USA] to [New York, BO, Sierra Leone])
• nearby locations, preferring locations with closeby or overlapping locations within a semantic distance (e.g. 'London St and Commercial Road' will select from road name choices with the same name based on spatial proximity)
• nearby geotag, preferring locations that are closeby or overlapping a geotag
• general before specific, rejecting locations with a higher admin level (or no admin level at all) compared to locations with a lower admin level (e.g. 'New York' will prefer [New York, USA] to [New York, BO, Sierra Leone]

Currently the following languages are supported:

• English, French, German, Italian, Portuguese, Russian, Ukrainian
• All other languages will work but there will be no language specific token expansion available

geoparsepy works with Python 3.7 and has been tested on Windows 10 and Ubuntu 18.04 LTS.

This geoparsing algorithm uses a large memory footprint (e.g. 12 Gbytes RAM for global cities), RAM size proportional to the number of cached locations, to maximize matching speed. It can be naively parallelized, with multiple geoparse processes loaded with different sets of locations and the geoparse results aggregated in a last process where location disambiguation is applied. This approach has been validated across an APACHE Storm cluster.

The software is copyright 2020 University of Southampton, UK. It was created over a multi-year period under EU FP7 projects TRIDEC (258723), REVEAL (610928), InnovateUK project LPLP (104875) and ESRC project FloraGuard (ES/R003254/1). This software can only be used for research, education or evaluation purposes. A free commercial license is available on request to @soton.ac.uk. The University of Southampton is open to discussions regarding collaboration in future research projects relating to this work.

Feature suggestions and/or bug reports can be sent to @soton.ac.uk. We do not however offer any software support beyond the examples and API documentation already provided.

Middleton, S.E. Middleton, L. Modafferi, S. Real-time Crisis Mapping of Natural Disasters using Social Media, Intelligent Systems, IEEE , vol.29, no.2, pp.9,17, Mar.-Apr. 2014

Middleton, S.E. Kordopatis-Zilos, G. Papadopoulos, S. Kompatsiaris, Y. Location Extraction from Social Media: Geoparsing, Location Disambiguation, and Geotagging, ACM Transactions on Information Systems (TOIS) 36, 4, Article 40 (June 2018), 27 pages. DOI: https://doi.org/10.1145/3202662. Presented at SIGIR 2019

A benchmark geoparse dataset is also available for free from the University of Southampton on request via email to @soton.ac.uk.

geoparsepy documentation resources

geoparsepy example code on github

Python libs needed (earlier versions may be suitable but are untested)

Python libs: psycopg2 >= 2.8, nltk >= 3.4, numpy >= 1.18, shapely >= 1.6, setuptools >= 46, soton-corenlppy>=1.0

Database: PostgreSQL >= 11.3, PostGIS >= 2.5

For LINUX deployments the following is needed:

python3 -m pip install geoparsepy

Databases needed for geoparsing

Download pre-processed UTF-8 encoded SQL table dumps from OSM image dated dec 2019. SQL dump is a 1.2 GB tar/zip file created using pg_dump and zipped using 7Zip tool.

Connect to PostgreSQL and create the database with the required PostGIS and hstore extensions

## Debian Science Geography packages

AVCE00 est une bibliothèque en C et un groupe d'outils pour faire apparaitre les couvertures vectorielles Arcinfo (binaire) comme E00. Ils permettent de lire et écrire des couvertures binaires comme si elles étaient des fichiers E00.

Drawmap reads data in the Digital Elevation Model (DEM), Digital Line Graph (DLG), and Geographic Names Information System (GNIS) formats. Can also work with SDTS, NAD-83, WGS-84, GTOPO30 data.

Using the data in these files, drawmap can produce various kinds of customized maps, including shaded relief maps (with or without roads, streams, place names, and so on) and topographic maps (again, with or without additional features).

Outputs sun raster format, portable gray map, or pov format files.

E00compr est une bibliothèque C ANSI qui lit et écrit des fichiers E00 contenant des données Arcinfo compressées. Les niveaux de compression « PARTIAL » et « FULL » sont tous deux gérés. Les fichiers E00 sont le format d'import/export de vecteurs pour Arcinfo. Ils sont en texte ASCII pur et dans un format destiné à l’échange. ESRI considère le format comme propriétaire, donc le paquet pourrait ne pas lire tous les fichiers E00 puisque ESRI pourrait en changer le format.

Ce paquet est utile pour l'importation de fichiers E00 dans le système d'information géographique GRASS.

Il contient le programme en ligne de commande e00conv qui prend un fichier E00 en entrée (compressé ou non) et le copie dans un nouveau fichier avec le niveau de compression demandé (NONE, PARTIAL ou FULL). La bibliothèque n'est pas incluse à ce niveau du développement.

The map data is fetched from a server on the net, and the client will display recent satellite images and map data.

GDAL (pour « Geospatial Data Abstraction Library ») est une bibliothèque de conversion des formats de données géospatiales brutes. Elle présente un modèle de données abstrait unique à l'application appelante pour tous les formats pris en compte. La bibliothèque OGR (qui est incluse dans l'arbre de GDAL) fournit une capacité similaire pour des données vectorielles aux fonctionnalités simples.

GDAL gère plus de 40 formats de données populaires, incluant ceux couramment utilisés (GeoTIFF, JPEG, PNG et d'autres) mais aussi ceux utilisés dans des Systèmes d'Information Géographique et des paquets logiciels de mesure de distances (ERDAS Imagine, ESRI Arc/Info, ENVI, PCI Geomatics). Il gère aussi plusieurs mesures de distances et des formats de distribution de données scientifiques comme HDF, EOS FAST, NOAA L1B, NetCDF, FITS.

La bibliothèque OGR gère les formats de données vectorielles populaires comme ESRI Shapefile, TIGER data, S57, MapInfo File, DGN, GML et d'autres.

Ce paquet contient les programmes utilitaires basés sur la bibliothèque GDAL/OGR nommés gdal_translate, gdalinfo, gdaladdo, gdalwarp, ogr2ogr, ogrinfo, ogrtindex.

GeoIP est une bibliothèque C qui permet de trouver le pays d'origine d'une adresse IP ou d'un nom d'hôte. Elle utilise une base de données fichier.

Cette base contient simplement des blocs d'IP comme clefs et les pays comme valeurs et devrait être plus complète et fiable que l'utilisation de requêtes DNS inverses.

Ce paquet fournit les utilitaires de ligne de commande pour résoudre les numéros d'IP en utilisant la bibliothèque GeoIP.

Cette tâche configure un système pour devenir une station de travail de SIG (système d'information géographique) pour traiter des informations géographiques et réaliser des cartes.

GMT (Generic Mapping Tools) est une collection d’outils d’UNIX permettant aux utilisateurs de manipuler des ensembles de données (x,y) et (x,y,z) (y compris le filtrage, l’ajustement de la tendance, le maillage, la projection, etc.) et produire des illustrations EPS (Encapsulated PostScript), allant de tracés simples x-y à l’aide de courbes de niveau jusqu’à des surfaces artificiellement colorées ou des vues en perspective 3D en noir et blanc, nuances de gris, avec des modèles d’hachures ou des couleurs en 24 bits.

GMT prend en charge beaucoup de projections cartographiques courantes et la mise à l’échelle linéaire, logarithmique ou d’intensité. Il est fourni avec la prise en charge de données telles que les littoraux, les rivières et les frontières politiques.

Gosmore est un visualiseur et un outil de recherche d'itinéraire d'openstreetmap.org qui prend en charge la synthèse vocale et la récupération de la position actuelle depuis gpsd.

Ce paquet nécessite des fichiers de données additionnels qui peuvent être téléchargés librement depuis openstreetmap.org.

GPSBabel convertit des points d’intérêt, des parcours et des itinéraires d’un format vers un autre, si celui-ci est un format de cartographie courant, tels que DeLorme ou Streets and Trips, ou même pour un transfert ou téléchargement série relatifs à un appareil GPS tels que ceux de Garmin et Magellan.

GPSBabel gère des dizaines de formats de données et est utile pour des travaux comme le géocaching, la cartographie ou la conversion d’un appareil GPS vers un autre. Parmi les formats les plus intéressants qu’il gère, sont concernés plusieurs appareils GPS à l’aide d’une liaison série, divers programmes de cartographie pour assistant personnel (PDA) et divers formats de données pour le géocaching.

## How’s the PhD going?

All these are perfectly valid responses to the above question. They are more polite than “none of your business” and all contains a grain of truth.

However, in the interest of a few (zero), I thought as well I might share some details here. Mainly motivated by the fact that in a couple of months I’ve managed to get two articles published. Which is kinda a big deal.

So, what’s been published then?

First there is “Micro-tasking as a method for human assessment and quality control in a geospatial data import“, published in “Cartography and Geographic Information Science” (or CaGIS). This article is based on Anne Sofies Master thesis, but has been substantially reworked in order to pass a scientific article. The premise is quite simple: How can microtasking be used to aid in an import of geospatial data to i.e. OpenStreetMap. Or, as the abstract puts it:

Crowd-sourced geospatial data can often be enriched by importing open governmental datasets as long as they are up-to date and of good quality. Unfortunately, merging datasets is not straight forward. In the context of geospatial data, spatial overlaps pose a particular problem, as existing data may be overwritten when a naïve, automated import strategy is employed. For example: OpenStreetMap has imported over 100 open geospatial datasets, but the requirement for human assessment makes this a time-consuming process which requires experienced volunteers or training. In this paper, we propose a hybrid import workflow that combines algorithmic filtering with human assessment using the micro-tasking method. This enables human assessment without the need for complex tools or prior experience. Using an online experiment, we investigated how import speed and accuracy is affected by volunteer experience and partitioning of the micro-task. We conclude that micro-tasking is a viable method for massive quality assessment that does not require volunteers to have prior experience working with geospatial data.

What did I learn from this? Well, statistics is hard. And complicated. And KEEP ALL YOUR DATA! And the review process is designed to drain the life our of you.

The second article was published a couple of days ago, in “Journal of Big Data”. Its’s titled “Efficient storage of heterogeneous geospatial data in spatial databases“, and here I am the sole author. The premise? Is NoSQL just a god-damn fad for lazy developers with a fear of database schemas? The conclusion? Pretty much. And PostGIS is cool. Or, in more scholarly terms:

The no-schema approach of NoSQL document stores is a tempting solution for importing heterogenous geospatial data to a spatial database. However, this approach means sacrificing the benefits of RDBMSes, such as existing integrations and the ACID principle. Previous comparisons of the document-store and table-based layout for storing geospatial data favours the document-store approach but does not consider importing data that can be segmented into homogenous datasets. In this paper we propose “The Heterogeneous Open Geodata Storage (HOGS)” system. HOGS is a command line utility that automates the process of importing geospatial data to a PostgreSQL/PostGIS database. It is developed in order to compare the performance of a traditional storage layout adhering to the ACID principle, and a NoSQL-inspired document store. A collection of eight open geospatial datasets comprising 15 million features was imported and queried in order to compare the differences between the two storage layouts. The results from a quantitative experiment are presented and shows that large amounts of open geospatial data can be stored using traditional RDBMSes using a table-based layout without any performance penalties.

• I managed to create a stupid acronym: HOGS
• The manuscript was first left in a drawer for five months, before the editor decided it wasn’t fit for the journal

The next journal provided such great reviews as

If you are importing a relatively static dataset such as the toppological dataset of Norway does it really matter if the import takes 1 hr 19 mins vrs 3 hours? It is very likely that this import will only be performed once every couple of months minimum. A DB admin is likely to set this running at night time and return in the morning to an imported dataset.

and

You are submitting your manuscript as “research article” to a journal that only cares about original research and not technical articles or database articles. For this reason, you need to justify the research behind it. The current state of the paper looks like a technical report. Again, an interesting and well-written one, but not a research article.

And the last reviewer (#2, why is it always reviewer #2?) who did not like the fact that I argued with him instead of doing what he said, and whose last comments was that I should add a section: “structure of the paper”. Well, I like the fact that some quality control is applied, but this borders the ridiculous.

Well, so there you have it three articles down (this was the first), at least one to go.

## Some thoughts about localization of Openstreetmap based maps

Following this tweet about a request of localized maps on osm.org I would like to share some thoughts on this topic.

OpenStreetMap is very useful in general, but, for me, is almost unusable in countries with non-Latin alphabets. The best option I've seen so far is this German-named map https://t.co/a6mQGWfY1P. It solves 95% of the problem, but does anyone know an English-named equivalent?

&mdash Laurence Tratt (@laurencetratt) August 31, 2018

My first versions of the localization code used in German style dates back to 2012. Back then I had the exact same problem as Laurence using OSM based maps in regions of the world where Latin script is not the norm and thus I started developing the localization code for German style.

Fortunately I was able to improve this code in December 2015 as part of a research project during my day job.

I also gave some talks about it in 2016 at FOSSGIS and FOSS4G conferences.
Recordings and slides of these talks are available at the l10n wiki.

Map localization seems to be mostly unprecedented in traditional GIS applications as before Openstreetmap there was no such thing as a global dataset of geographical data.

Contrary to my initial thought doing localization „good enough“ is not an easy task and I learned a lot of stuff about writing systems that in fact I not even wanted to know.

What I intend to share here is basically the dos and don’ts of map localization.

Currently my code is implemented mostly as PostgreSQL shared procedures, which was a good idea back in 2012 when rendering almost always involved PostgreSQL/PostGIS at some stage anyway. This will likely change in a vector tile only tool chain used in future. To take this into account in the meantime I also have a proof of concept implementation written in python.

So what is the current state of affairs?

Basically there are two functions which will output either a localized street name or place name using an associative array of tags and a geometry object as input. In the output street names are separated by „-“ while place names are usually two-line strings. Additionally street names are abbreviated whenever possible (if I know how to do this in a particular language). Feel free to send patches if you language does not contain abbreviations yet!

Initialy I used to put the localized name in parenthesis, but this is not a very good idea for various reasons. First of all which one would be the correct name to put in parenthesis? And even more important, what would one do in the case of scripts like arabic or hebrew? So I finaly got rid of the parenthesis altogether.

What else does the code in which way and whats the rationale behind it?

There are various regions of the world with more than one official language. In those regions the generic name tag will usually contain both names which will just make sense if only this tag is rendered like osm carto does.

So what to do in those cases?

Well if the desired target language name is part of the generic name tag just use this one and avoid duplicates at any cost! As an example lets take Bolzano/Bozen in the autonomous province South Tyrol. Official languages there are Italian and German thus the generic name tag will be „Bolzano – Bozen“. Doing some search magic in various name tags we will end up using „Bolzano Bozen“ in German localization and using „Bolzano – Bozen“ unaltered in English localization because there is no name:en tag.

But what to do if name contains non latin scripts?

The main rationale behind my whole code is that the mapper is always right and that automatic transcription should be only used as a last resort.

This said please do not tag transcriptions as localized names in any case because they will be redundant at best and plain wrong at worst. This is a job that computers should be able to do better. Also do never map automated transcriptions.

Transcriptions might be mapped in cases when they are printed on an official place-name sign. Please use the appropriate tag like name:jp_rm or name:ko-Latn in this case and not something like name:en or name:de.

(Image ©Heinrich Damm Wikimedia Commons CC BY-SA 3.0)

Correct tagging (IMO) should be:
name=ถนนเยาวราช
name:th=ถนนเยาวราช
name:th-Latn=thanon yaoverat
name:en=CHINA TOWN

So a few final words to transcription and the code currently in use. Please keep in mind that transcription is always done as a last resort only in case when there are no suitable name-tags on the object.

Some of the readers may already know the difference between transcription and transliteration. Nevertheless some may not so I will explain it. While transliteration is fully reversible transcription might not always be. So in case of rendered maps transcription is likely what we want to have because we do not care about a reversible algorithm in this case.

First I started with a rather naive approach. I just used the Any-Latin transliteration code from libicu. Unfortunately this was not a very good idea in a couple of cases thus I went for a little bit more sophisticated approach.

So here is how the current code performs transcription:

1. Call a function to get the country where the object is located at
(This function is actually based on a database table from nominatim)
2. If the country in question is one with a country specific transcription algorithm go for this one and use libicu otherwise.

Currently in Japan kakasi is used instead of libicu in order to avoid chinese transcriptions and in Thailand some python code is used because libicu uses a rarely used ISO standard transliteration instead of the more common Royal Thai General System of Transcription (RTGS).

There are still a couple of other issues. The most notable one is likely the fact, that transcription of arabic is far from perfect as vowels are usually not part of names in this case. Furthermore transcription based on pronunciation is difficult as arabic script is used for very different languages.

So where to go from here?

Having localized rendering on osm.org for every requested language is unrealistic using the current technology as any additional language will double the effort of map rendering. Although my current code might even produce some strange results when non-latin output languages are selected.

This said it would be very easy to setup a tile-server with localized rendering in any target language using Latin script. For this purpose you might not even need to use the German Mapnik style as I even maintain a localized version of vanilla OSM Carto style.

Actually I have a Tileserver running this code with English localization at my workplace.

So as for a map with English localization http://www.openstreetmap.us/ or
http://www.openstreetmap.co.uk would be the right place to host such a map.

So why not implementing this on osm.org? I suppose that this should be done as part of the transition to vector tiles whenever this will happen. As the back-end technology of the vector-tiles server is not yet known I can not tell how suitable my code would be for this case. Likely it might need to be rewritten in C++ for this purpose. As I already wrote, I have a proof of concept implementation written in python which can be used to localize osm/pbf files.

## Map rendering on EC2

Over the last two years I’ve been running the OpenCycleMap tileserver on Amazon’s EC2 service. Plenty of other people do the same, and I get asked about it a lot when I’m doing consulting for other companies. I thought it would be good to take some time to say a bit about my experiences, and maybe this will be useful to you at some point.

EC2 is great if you have a need for lots and lots of computing power, and your need for using CPUs fluctuates. At its best, you have a task that needs hundreds of CPUs, but only for a few hours. So you can spin up as many instances as you like, do your task, and switch them back off again. Map rendering, and here I’m talking about mapnik/mod_tile rendering of OpenStreetMap data, initially seems to hit that use-case – generating map tiles involves lots of processing of the map data, and then you have your finished map images which are trivial to serve.

But that’s not really the case, it turns out. After you’ve finished experimenting with small areas and start moving to a global map, you find that disk IO is by far the most important thing. There are two stages to the data processing – import and rendering. During import you take a 10Gb openstreetmap planet file and feed it into PostGIS with osm2pgsql. You want to use osm2pgsql –slim (to allow diff updates), but that involves huge amounts of writing and reading from disk for the intermediate tables. It can take literally weeks to import. When you’re rendering, renderd lifts the data from the database, renders it, writing the tiles back to disk, and then mod_tile reads the disk store to send the tiles to the client. All in all, lots of disk activity. And hugely more if you mention contours or hillshading.

Which wouldn’t be too bad, except the disks on EC2 suck. It’s not a criticism, since it’s an Elastic Compute Cloud, not an Elastic Awesome-Disks Cloud. It’s a system designed for doing calculations, not handling reading and writing huge datasets to and from disk. So their virtual disks are much slower than you would like or expect from the rest of the specs. On the opencyclemap “large” EC2 instance, roughly one core is being used for processing, and the rest is all blocked on IO. Although it’s marked as having “high” IO performance on their instance types page, I’d suggest for “moderate” and “high” you should read “dreadful” and “merely poor” respectively.

Amazon’s S3 is their storage component of their Web Services suite. So instead of thrashing the disks on EC2, how about storing tiles on S3? It’s possible, but the main drawback is that it makes it much, much harder to generate tiles on-the-fly. If you point your web app at an S3 bucket there’s no way that I know of to pass 404s onto an EC2 instance to fulfil. If you’re happy with added latency, then you could still run a server that queries S3 before deciding to render, and copy the output to S3, but I can’t imagine that being faster than using EC2’s local storage. You can certainly use S3 to store limited tilesets, such as limited geographical areas or a limited number of zooms. But pre-generating a full planet’s worth of z18 tiles would take up terabytes of space, and only a vanishingly small number of tiles would ever be served.

Finally, there is the cost of running a tileserver. Although Amazon are quite cheap if you want a hundred servers for a few hours, the costs start mounting if you have only one server running 24 hours a day – which is what you need from a tileserver or any other kind of webserver. .34 per hour seems reasonable until you price for the first four weeks uptime, where all kinds of non-cloud providers come into play, simply paying monthly rent on a server instead. Factoring in bandwidth costs for a moderately well-used tileserver can make it mightily expensive. Any extras can be added too – EBS if you want your database to survive the instance being pulled, or S3 storage.

EC2 is, more or less, exactly not what you want from a tileserver. Expensive to run, slow disks. So why is it popular? First off is buzzwords – cloud, scalable and so on. If you aren’t careful you can easily empty the piggybank on running a handful of tileservers long before you’re running enough to do proper demand-based scaling changing from hour to hour during the day. If you’re trying to “enterprise” your system you’ll worry about failovers long before you need such elastic scaling, and you need your failovers and load balancers running 24࡭ too. Second is for capacity planning – if you want to do no planning whatsoever, then EC2 is great! But it’s much cheaper to rent a few servers for the first couple of months, and add more to your pool when (if?) your tileserver gets popular. But a there is a third reason that is quite cool – for people like Development Seed’s TileMill – you can give your tileserver image to someone else extremely easily, and it’s their credit card that gets billed, and they can turn on and off as many servers as they like without hassling you.

I’ve been setting up a new tileserver for OpenCycleMap that’s not on EC2, and I’ll post here again later with details of how I got on. I’m also working on another couple of map styles – with terrain data, of course, and if you’re interested in hearing more then get in touch.

• I’d recommend EC2 if you want to pre-generate large numbers of tiles (say a continent down to z16), copy them somewhere and then switch off the renderer
• I’d consider EC2 for ultra-large setups where you are running 5 or more tileservers already, but only as additional-load machines
• I wouldn’t recommend EC2 if you want to run an on-the-fly tileserver. Which is what most people want to do.

Any thoughts? Running a tileserver on EC2 and disagree? Let me know below.

## The SOSI-format: The crazy, Norwegian, geospatial file format

Imagine trying to coordinate the exchange of geospatial data long before the birth of the Shapefile, before XML and JSON was thought of. Around the time when “microcomputers” was _really_ new, and mainframes was the norm. Before I was born.

Despite this (rather sorry) state of affairs, you realize that the “growing use of digital methods used in the production and use of geospatial data raises several coordiantion-issues” [my translation]. In addition, “there is an expressed wish from both software companies and users of geospatial data that new developments does not lead to a chaos of digital information that cannot be used without in-depth knowledge and large investments in software” [my translation].

Pretty forward-thinking if you ask me. Who was thinking about this in 1980? Turns out that two Norwegians, Stein W. Bie and Einar Stormark, did this in 1980, by writing this report.

This report is fantastic. It’s the first hint of a format that Norwegians working with geospatial data (and few others) still has to relate to today. The format, known as the “SOSI-Format” (not to be confused with the SOSI Standard) is a plaintext format for representing points, lines, areas and a bunch of other shapes, in addition to attribute data.

My reaction when I first encountered this format some 8 years ago was “what the hell is this?”, and I started on a crusade to get rid of the format (“there surely are better formats”). But I was hit by a lot of resistance. Partly because I confused the format with the standard, partly because I was young and did not know my history, partly because the format is still in widespread use, and partly because the format is in many ways really cool!

So, I started reading up on the format a bit (and made a parser for it in JavaScript, sosi.js). One thing that struck me was that a lot of things I’ve seen popping up lately has been in the SOSI-format for ages. Shared borders (as in TopoJSON) Check! Local origins (to save space) Check! Complex geometries (like arcs etc) Check!

But, what is it like? It’s a file written in what’s referred to as “dot-notation” (take a look at this file and you’ll understand why). The format was inspired by the british/canadion format FILEMATCH and a french database-system called SIGMI (anyone?).

The format is, as stated, text (i.e. ASCII) based, with the reason was that this ensured that data could be stored and transferred on a wide range of media. At the time of writing the report, there existed FORTRAN-implementations (for both Nord-10/S and UNIVAC 1100) for reading and writing. Nowadays, there exists several closed-source readers and writes for the format (implemented by several Norwegian GIS vendors), in addition to several Open Source readers.

The format is slated for replacement by some GML-variation, but we are still waiting. There is also GDAL/OGR support for the format, courtesy of Statens Kartverk. However, this requires a lot of hoop-jumping and make-magic on Linux. In addition, the current implementation does not work with utf-8, which is a major drawback as most .SOS-files these days are in fact utf-8.

So, there we are. The official Norwegian format for exchange of geographic information in 2018 is a nearly 40 year old plain text format. And the crazy thing is that this Norwegian oddity is actually something other countries are envious about, as we actually have a common, open (!), standard for this purpose, not some de-facto reverse-engineered binary format.

And, why indeed, why should the age of a format be an issue, as long as it works?

## Map rendering on EC2

Over the last two years I’ve been running the OpenCycleMap tileserver on Amazon’s EC2 service. Plenty of other people do the same, and I get asked about it a lot when I’m doing consulting for other companies. I thought it would be good to take some time to say a bit about my experiences, and maybe this will be useful to you at some point.

EC2 is great if you have a need for lots and lots of computing power, and your need for using CPUs fluctuates. At its best, you have a task that needs hundreds of CPUs, but only for a few hours. So you can spin up as many instances as you like, do your task, and switch them back off again. Map rendering, and here I’m talking about mapnik/mod_tile rendering of OpenStreetMap data, initially seems to hit that use-case – generating map tiles involves lots of processing of the map data, and then you have your finished map images which are trivial to serve.

But that’s not really the case, it turns out. After you’ve finished experimenting with small areas and start moving to a global map, you find that disk IO is by far the most important thing. There are two stages to the data processing – import and rendering. During import you take a 10Gb openstreetmap planet file and feed it into PostGIS with osm2pgsql. You want to use osm2pgsql –slim (to allow diff updates), but that involves huge amounts of writing and reading from disk for the intermediate tables. It can take literally weeks to import. When you’re rendering, renderd lifts the data from the database, renders it, writing the tiles back to disk, and then mod_tile reads the disk store to send the tiles to the client. All in all, lots of disk activity. And hugely more if you mention contours or hillshading.

Which wouldn’t be too bad, except the disks on EC2 suck. It’s not a criticism, since it’s an Elastic Compute Cloud, not an Elastic Awesome-Disks Cloud. It’s a system designed for doing calculations, not handling reading and writing huge datasets to and from disk. So their virtual disks are much slower than you would like or expect from the rest of the specs. On the opencyclemap “large” EC2 instance, roughly one core is being used for processing, and the rest is all blocked on IO. Although it’s marked as having “high” IO performance on their instance types page, I’d suggest for “moderate” and “high” you should read “dreadful” and “merely poor” respectively.

Amazon’s S3 is their storage component of their Web Services suite. So instead of thrashing the disks on EC2, how about storing tiles on S3? It’s possible, but the main drawback is that it makes it much, much harder to generate tiles on-the-fly. If you point your web app at an S3 bucket there’s no way that I know of to pass 404s onto an EC2 instance to fulfil. If you’re happy with added latency, then you could still run a server that queries S3 before deciding to render, and copy the output to S3, but I can’t imagine that being faster than using EC2’s local storage. You can certainly use S3 to store limited tilesets, such as limited geographical areas or a limited number of zooms. But pre-generating a full planet’s worth of z18 tiles would take up terabytes of space, and only a vanishingly small number of tiles would ever be served.

Finally, there is the cost of running a tileserver. Although Amazon are quite cheap if you want a hundred servers for a few hours, the costs start mounting if you have only one server running 24 hours a day – which is what you need from a tileserver or any other kind of webserver. .34 per hour seems reasonable until you price for the first four weeks uptime, where all kinds of non-cloud providers come into play, simply paying monthly rent on a server instead. Factoring in bandwidth costs for a moderately well-used tileserver can make it mightily expensive. Any extras can be added too – EBS if you want your database to survive the instance being pulled, or S3 storage.

EC2 is, more or less, exactly not what you want from a tileserver. Expensive to run, slow disks. So why is it popular? First off is buzzwords – cloud, scalable and so on. If you aren’t careful you can easily empty the piggybank on running a handful of tileservers long before you’re running enough to do proper demand-based scaling changing from hour to hour during the day. If you’re trying to “enterprise” your system you’ll worry about failovers long before you need such elastic scaling, and you need your failovers and load balancers running 24࡭ too. Second is for capacity planning – if you want to do no planning whatsoever, then EC2 is great! But it’s much cheaper to rent a few servers for the first couple of months, and add more to your pool when (if?) your tileserver gets popular. But a there is a third reason that is quite cool – for people like Development Seed’s TileMill – you can give your tileserver image to someone else extremely easily, and it’s their credit card that gets billed, and they can turn on and off as many servers as they like without hassling you.

I’ve been setting up a new tileserver for OpenCycleMap that’s not on EC2, and I’ll post here again later with details of how I got on. I’m also working on another couple of map styles – with terrain data, of course, and if you’re interested in hearing more then get in touch.

• I’d recommend EC2 if you want to pre-generate large numbers of tiles (say a continent down to z16), copy them somewhere and then switch off the renderer
• I’d consider EC2 for ultra-large setups where you are running 5 or more tileservers already, but only as additional-load machines
• I wouldn’t recommend EC2 if you want to run an on-the-fly tileserver. Which is what most people want to do.

Any thoughts? Running a tileserver on EC2 and disagree? Let me know below.

## The Open Geospatial Data Ecosystem

This summer my first peer-reviewed article, “The Open Geospatial Data Ecosystem”, was published in “Kart og plan”. Unfortunately, the journal is not that digital, and they decided to withhold the issue from the web for a year, “in order to protect the printed version”. What?!

However, I was provided a link to a pdf of my article, and told I could distribute it. I interpret this as an approval of me publishing the article on my blog, so that is exactly what I’ll do.