PROBLEM: I want to extract country information from a user description. So far, I'm giving a try with the geograpy package. I like the behavior when the input is not very clear for example in Evesham or Rochdale, however, the package interprets some strings like Zaragoza, Spain
as two mentions while the user is clearing saying that its location is in Spain. Still, I don't know why amsterdam is not giving as output Holland... How can I improve the outputs? Am I missing anything important? Is there a better package to achieve this?
DATA: My data example is:
user_location
2 Socialist Republic of Alachua
3 Hérault, France
4 Gwalior, India
5 Zaragoza,España
7 amsterdam
8 Evesham
9 Rochdale
I want to get something like this:
user_location country
2 Socialist Republic of Alachua ['USSR', 'United States']
3 Hérault, France ['France']
4 Gwalior, India ['India']
5 Zaragoza,España ['Spain']
7 amsterdam ['Holland']
8 Evesham ['United Kingdom']
9 Rochdale ['United Kingdom', 'United States']
REPREX:
import pandas as pd
import geograpy3
df = pd.DataFrame.from_dict({'user_location': {2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}})
df['country'] = df['user_location'].apply(lambda x: geograpy.get_place_context(text=x).countries if pd.notnull(x) else x)
print(df)
#> user_location country
#> 2 Socialist Republic of Alachua [USSR, Union of Soviet Socialist Republics, Al...
#> 3 Hérault, France [France, Hérault]
#> 4 Gwalior, India [British Indian Ocean Territory, Gwalior, India]
#> 5 Zaragoza,España [Zaragoza, España, Spain, El Salvador]
#> 7 amsterdam []
#> 8 Evesham [Evesham, United Kingdom]
#> 9 Rochdale [Rochdale, United Kingdom, United States]
Created on 2020-06-02 by the reprexpy package