3

I am working on a dataset of user comments in which they mention locations. I am using nltk (StanfordNERTagger) and spacy to pull out the locations. The problem is that they are in many different forms. Ex: (nyc vs New York City, ga vs Georgia, etc). Additionally, I wish to retrieve the state for a given city. Is there a library or way to normalize these in Python? A library that for instance worked like this:

g = geo_classify('New York City')
g.cities() => ['New York City']
g.states() => ['New York']
g.countries() => ['United States']

I tried using Geograpy3 but it didn't detect some cities, any abbreviations, and didn't give me the state for a given city. Any suggestions?

Big_Mac
  • 2,975
  • 4
  • 20
  • 36
  • 1
    I can't give you a specific tool, but I think the problem you describe is called "entity linking", the task of linking named entities to some real world knowledge base. SpaCy has [the functionality](https://spacy.io/usage/linguistic-features#entity-linking), but they don't seem to provide a pretrained model. – fsimonjetz Jun 29 '21 at 14:57
  • Thanks @fsimonjetz, I'll look into that. Settled on using Google's Geocoding API to use the locations I pulled out as a search. Not ideal and costs a few bucks but works fine for my case atm. – Big_Mac Jul 01 '21 at 15:28

1 Answers1

2

@Big_Mac - I am one of the committers of Geograpy. Thank you for trying Geograpy3. You might want to use the locator interface and the newest release of geograpy3 which is due these days. There is now a CityLookup, RegionLookup and CountryLookup which takes different labels of according to wikidata into account.

Here is a "preview" of what to expect. Internally the following SQL database query will be used:

NY example

query

select * from cityLookup where label='New York City'

result

label level locationKind wikidataid name geoNameId regionId countryId pop lat lon partOfRegionId gndId regionName regionIso regionPop regionLat regionLon countryName countryIso CountryLat CountryLon
New York City 5 City Q60 New York City 5128581 Q1384 Q30 8398748 41 -74 Q1384 4042011-5 New York US-NY 19795791 43 -75 United States of America US 40 -99

Here is some example code i am currently using to test the current state of geograpy3:

'''
Created on 2021-08-11

@author: wf
'''
#from lodstorage.entity import EntityManager
from geograpy.locator import LocationContext
import OSMPythonTools
from OSMPythonTools.nominatim import Nominatim 
import os
import logging

class LocationLookup:
    '''
    lookup locations
    '''
    preDefinedLocations={
        "Not Known": None,
        "Online": None,
    }
    other={
        "Washington, DC, USA": "Q61",
        "Bangalore": "Q1355",
        "Bangalore, India": "Q1355",
        "Xi'an": "Q5826",
        "Xi'an, China": "Q5826",
        "Virtual Event USA": "Q30",
        "Virtual USA": "Q30",
        "London United Kingdom": "Q84",
        "Brno":"Q14960",
        "Cancun":"Q8969",
        "St. Petersburg": "Q656",
        "Gothenburg Sweden": "Q25287",
        "Los Angeles California": "Q65",
        "Zurich, Switzerland": "Q72",
        "Barcelona Spain": "Q1492",
        "Vienna Austria": "Q1741",
        "Seoul Republic of Korea": "Q8684",
        "Seattle WA USA": "Q5083",
        "Singapore Singapore":"Q334",
        "Tokyo Japan": "Q1490",
        "Vancouver BC Canada": "Q24639",
        "Vancouver British Columbia Canada": "Q24639",
        "Amsterdam Netherlands":"Q727",
        "Paris France": "Q90",
        "Nagoya": "Q11751",
        "Marrakech":"Q101625",
        "Austin Texas":"Q16559",
        "Chicago IL USA":"Q1297",
        "Bangkok Thailand":"Q1861",
        "Firenze, Italy":"Q2044",
        "Florence Italy":"Q2044",
        "Timisoara":"Q83404",
        "Langkawi":"Q273303",
        "Beijing China":"Q956",
        "Berlin Germany": "Q64",
        "Prague Czech Republic":"Q1085",
        "Portland Oregon USA":"Q6106",
        "Portland OR USA":"Q6106",
        "Pittsburgh PA USA":"Q1342",
        "Новосибирск":"Q883",
        "Los Angeles CA USA":"Q65",
        "Kyoto Japan": "Q34600"
    }

    def __init__(self):
        '''
        Constructor
        '''
        self.locationContext=LocationContext.fromCache()
        cacheRootDir=LocationContext.getDefaultConfig().cacheRootDir
        cacheDir=f"{cacheRootDir}/.nominatim"
        if not os.path.exists(cacheDir):
            os.makedirs(cacheDir)
            
        self.nominatim = Nominatim(cacheDir=cacheDir)
        logging.getLogger('OSMPythonTools').setLevel(logging.ERROR)
        
        
    def getCityByWikiDataId(self,wikidataID:str):
        '''
        get the city for the given wikidataID
        '''
        citiesGen=self.locationContext.cityManager.getLocationsByWikidataId(wikidataID)
        if citiesGen is not None:
            cities=list(citiesGen)
            if len(cities)>0:
                return cities[0]
        else:
            return None
        
    def lookupNominatim(self,locationText:str):
        location=None
        nresult=self.nominatim.query(locationText,params={"extratags":"1"})
        nlod=nresult._json
        if len(nlod)>0:
            nrecord=nlod[0]
            if "extratags" in nrecord:
                extratags=nrecord["extratags"]
                if "wikidata" in extratags:
                    wikidataID=extratags["wikidata"]
                    location=self.getCityByWikiDataId(wikidataID)
        return location
        
    def lookup(self,locationText:str):
        lg=self.lookupGeograpy(locationText)
        ln=self.lookupNominatim(locationText)
        if ln is not None and lg is not None and not ln.wikidataid==lg.wikidataid:
            print(f"❌{locationText}→{lg}!={ln}")
            return None
        return lg
        
    def lookupGeograpy(self,locationText:str):
        '''
        lookup the given location by the given locationText
        '''
        if locationText in LocationLookup.preDefinedLocations:
            locationId=LocationLookup.preDefinedLocations[locationText]
            if locationId is None:
                return None
            else:
                location=self.getCityByWikiDataId(locationId)
                return location
        locations=self.locationContext.locateLocation(locationText)
        if len(locations)>0:
            return locations[0]
        else:
            return None
Wolfgang Fahl
  • 15,016
  • 11
  • 93
  • 186