1

I tried a simple demo to check if geograpy can do what i'm looking for: trying to find the country name and iso code in denormalized addresses (which is basically what geograpy is written for!).
The problem is that, in the test i made, geograpy is able to found several country for each address used, including the right in most of cases, but i can't find any type of parameters to decide which country is the most "correct".
The list of fake addresses that i used, which may reflect reality that could be analyzed, is this:

  • John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
  • John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
  • John Doe 30 Huntington Terrace Newark, New York 07112 USA
  • John Doe 22 Huntington Terrace Newark, New York 07112 US
  • Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
  • Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy

This is the simple code written:

import geograpy

ind = ["John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti",
"John Doe 160 Huntington Terrace Newark, New York 07112 United States of America",
"John Doe 30 Huntington Terrace Newark, New York 07112 USA",
"John Doe 22 Huntington Terrace Newark, New York 07112 US",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy"]

locator = geograpy.locator.Locator()
for address in ind:
    places = geograpy.get_place_context(text=address)
    print(address)
    #print(places)
    for country in places.countries:
      print("Country:"+country+", IsoCode:"+locator.getCountry(name=country).iso)
    print()

and this is the output:

John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US

John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
Country:United States, IsoCode:US
Country:United Kingdom, IsoCode:GB
Country:Netherlands, IsoCode:NL
Country:Jamaica, IsoCode:JM
Country:Argentina, IsoCode:AR

John Doe 30 Huntington Terrace Newark, New York 07112 USA
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US

John Doe 22 Huntington Terrace Newark, New York 07112 US
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US

Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US

Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy
Country:Italy, IsoCode:IT
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US

First of all, the biggest problem is that in italian address (number 4) is unable to find at all the right country (Italia/Italy), and i don't know from where the three country found comes from.
The seconds it that in most cases it find wrong country, in addiction to the right, and i don't have any type of indicator about confidence percentage, distance, or something that could me understand if a country located could be considered acceptable as answer and, in multiple results, what could be the "best".

I want to apologize in advance, but I didn't have time to study geograpy3 in depth and i don't know if this is a stupid question, but i haven't found anything about confidence/probability/distance in documentation.

blkid
  • 21
  • 6

1 Answers1

0

I am answering as a committer of geograpy3.

It looks like you are trying to use the legacy interface of geograpy Version1 times for your first step and only then use the locator. For your usecase the improved locator interface might be much more reasonable. This interface can use extra information like population or gdp per capita to find the "most likely" country for disambiguation.

The Stati Uniti/United States Italia/Italy issue is a language problem - see the long standing open issue https://github.com/ushahidi/geograpy/issues/23 of geograpy version1. As of today there seems to be no new issue in geograpy3 yet - feel free to file one if you need this improvement.

I added your example to test_locator.py in the geograpy3 project to show the difference in the concepts:

def testStackOverflow64379688(self):
        '''
        compare old and new geograpy interface
        '''
        examples=['John Doe 160 Huntington Terrace Newark, New York 07112 United States of America',
                  'John Doe 30 Huntington Terrace Newark, New York 07112 USA',
                  'John Doe 22 Huntington Terrace Newark, New York 07112 US',
                  'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia',
                  'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy',
                  'Newark','Rome']
        for example in examples:
            city=geograpy.locateCity(example,debug=False)
            print(city)

Result:

None
None
None
None
None
Newark (US-NJ(New Jersey) - US(United States))
Rome (IT-62(Latium) - IT(Italy))
Wolfgang Fahl
  • 15,016
  • 11
  • 93
  • 186
  • Thanks for the reply. But i have some doubt: The new interface of geograpy that you indicated to me is unable at all to find Country in my address examples. As you wrote in your tests, for all my addresses the results is "None", but i would expect for the first 3 USA and for the last 2 ITA. Is there something I'm not understanding? – blkid Oct 22 '20 at 11:11
  • Indeed your expectations and what the interface does are different. For your expecation a different interface is needed. you'd like to go from a sentence with words that might or might not contain geographic information right to correct city answers. A multi step process is IMHO needed: first find the parts of the text that correspond to the cities and then do the extraction. In your case the combination of the steps fails either because of step 1 or step 2. As you can see step 2 would work. If step 1 finds Roma step 2 won't work since english it needs to be Rome. – Wolfgang Fahl Oct 22 '20 at 14:52
  • in other case step 1 fails because the context prevents it. If you step thru the code with a debugger or switch debug mode on you'll see what is happening. You may also contact me personally and I might show you the details in an online session if you are interested. – Wolfgang Fahl Oct 22 '20 at 14:55
  • Thanks again for your answer. I already imagined having to go for "refinements" steps and, for address split, i found [libpostal](https://github.com/openvenues/libpostal): _The goal is to understand location strings in every language, everywhere_. So i supposed to use it to trying to extract the Country and then use another library to "normalize" it and obtain the ISO code. But with geograpy3 there is the language problem, so i must pass with a translation step and I don't think it's very easy to do it "offline" on denormalized country name. I'll internally check what we want to achieve! – blkid Oct 23 '20 at 15:11
  • http://wiki.bitplan.com/index.php/Geograpy#Examples has more details on where the data used in geograpy comes from. If you add more country labels in the wikidata queries you'll get more results. You might like https://pypi.org/project/pylodstorage/ which simplifies the conversion from an rdf result to json or a relational database like sqlite. – Wolfgang Fahl Oct 23 '20 at 15:17