I tried a simple demo to check if geograpy can do what i'm looking for: trying to find the country name and iso code in denormalized addresses (which is basically what geograpy is written for!).
The problem is that, in the test i made, geograpy is able to found several country for each address used, including the right in most of cases, but i can't find any type of parameters to decide which country is the most "correct".
The list of fake addresses that i used, which may reflect reality that could be analyzed, is this:
- John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
- John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
- John Doe 30 Huntington Terrace Newark, New York 07112 USA
- John Doe 22 Huntington Terrace Newark, New York 07112 US
- Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
- Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy
This is the simple code written:
import geograpy
ind = ["John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti",
"John Doe 160 Huntington Terrace Newark, New York 07112 United States of America",
"John Doe 30 Huntington Terrace Newark, New York 07112 USA",
"John Doe 22 Huntington Terrace Newark, New York 07112 US",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy"]
locator = geograpy.locator.Locator()
for address in ind:
places = geograpy.get_place_context(text=address)
print(address)
#print(places)
for country in places.countries:
print("Country:"+country+", IsoCode:"+locator.getCountry(name=country).iso)
print()
and this is the output:
John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
Country:United States, IsoCode:US
Country:United Kingdom, IsoCode:GB
Country:Netherlands, IsoCode:NL
Country:Jamaica, IsoCode:JM
Country:Argentina, IsoCode:AR
John Doe 30 Huntington Terrace Newark, New York 07112 USA
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
John Doe 22 Huntington Terrace Newark, New York 07112 US
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy
Country:Italy, IsoCode:IT
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US
First of all, the biggest problem is that in italian address (number 4) is unable to find at all the right country (Italia/Italy), and i don't know from where the three country found comes from.
The seconds it that in most cases it find wrong country, in addiction to the right, and i don't have any type of indicator about confidence percentage, distance, or something that could me understand if a country located could be considered acceptable as answer and, in multiple results, what could be the "best".
I want to apologize in advance, but I didn't have time to study geograpy3 in depth and i don't know if this is a stupid question, but i haven't found anything about confidence/probability/distance in documentation.