OSM Name Vandalism Corpus Released

May 24th, 2021

Facebook chooses OpenStreetMap for its global basemap data thanks to OSM’s global coverage and ease of use. OSM’s community mapping approach makes it a natural fit for our needs in Asia, Africa, Latin America, and other geographies outside the range of traditional commercially-collected data, but also introduces a small amount of risk that vandalism might appear in map labels on our pages. To make OSM work for our users we’ve developed models and processes to detect name vandalism in OSM edits. Today we’re releasing an OSM vandalism corpus for the name attribute of OSM objects, to support future research and work on detecting and neutralizing vandalism in open mapping data projects.


For this release, software engineer Yinxiao Li grouped a selection of real OpenStreetMap edits into two categories:

📦 Download the OSM Name Vandalism Corpus: vandalism_corpus_2021-05.csv (70MB). This file is composed of 100% OSM data, released under the terms of the Open Database License.

Existing OSM analysis has been slowed by the lack of data like this. Wikipedia has long enjoyed academic interest and public vandalism datasets based on the project have been available since 2010. OpenStreetMap researchers have not had the benefit of similar data. We hope that this release of categorized OSM data will be useful for future OSM data investigations.