2

I have a list of strings (noun phrases) and I want to filter out all valid geographical locations from them. Most of these (unwanted location names) are country or city or state names. What would be a way to do this? Is there any open-source lookup table available which contains all country, states, cities of the world?

Example desired output: TREC4: false, Vienna: true, Ministry: false, IBM: false, Montreal: true, Singapore: true

Unlike this post: Verify user input location string is a valid geographic location? I have a high number of strings like these (~0.7 million) so google geolocation API is probably not an option for me.

Community
  • 1
  • 1
Soumyajit
  • 435
  • 1
  • 9
  • 19
  • 1
    How about: en-ner-location.bin from http://opennlp.sourceforge.net/models-1.5/ or something like http://stackoverflow.com/questions/18371092/stanford-named-entity-recognizer-ner-functionality-with-nltk – alvas Jan 08 '16 at 19:34
  • I used the NLTK ner recognition. The stanford ner tagger looks good, I will give it a try.. – Soumyajit Jan 09 '16 at 16:30

2 Answers2

3

You can use geoplanet data by Yahoo, or geonames data by geonames.org. Here is a link to geoplanet TSV file containing 5 million geographical places of the world : https://developer.yahoo.com/geo/geoplanet/data/

Moreover, geoplanet data will provide you type ( city,country,suburb etc) of the geographical place, along with a unique id. https://developer.yahoo.com/geo/geoplanet/guide/concepts.html

You can do a lowercase, sanitized ( e.g. remove special characters and other anomalies) match of your needle string to the names present in this data. If you do not want full file scans, first processing this data to store it in a fast lookup database like mongodb or redis will be beneficial.

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
  • It look like Yahoo has stopped giving the dataset for download. They are providing an API instead :\ .... Anyway im looking into it. – Soumyajit Jan 09 '16 at 16:28
  • The database might be available for download at other sources. You can try geonames or openstreetmap data as well. – DhruvPathak Jan 09 '16 at 16:51
1

I can suggest the following three options:

a) Using the Alchemy API: http://www.alchemyapi.com/ If you try their demo, places like France, Honolulu give the entity type as Country or City

b) Using TAGME: http://tagme.di.unipi.it/ TAGME connects every entity in a given text to the corresponding wikipedia page. Crawl the wikipedia page and check the infobox and filter

c) Using Wikipedia Miner: I was unable to find relevant links for this. However, this also works like TAGME.

Suggest you to try all three and do majority voting for each instance.

Ayushi Dalmia
  • 295
  • 4
  • 14