0

I want to get the lat of ~ 100 k entries in a pandas dataframe. Since I can query geopy only with a second delay, I want to make sure I do not query duplicates (most should be duplicates since there are not that many cities)

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="xxx")
df['loc']=0
    for x in range(1,len(df):
            for y in range(1,x):
                if df['Location'][y]==df['Location'][x]:
                    df['lat'][x]=df['lat'][y]
                else:
                    location = geolocator.geocode(df['Location'][x])
                    time.sleep(1.2)
                    df.at[x,'lat']=location.latitude

The idea is to check if the location is already in the list, and only if not query geopy. Somehow it is painfully slow and seems not to be doing what I intended. Any help or tip is appreciated.

lczapski
  • 4,026
  • 3
  • 16
  • 32
hmmmbob
  • 1,167
  • 5
  • 19
  • 33

2 Answers2

0

Imports

  • see geopy documentation for how to instantiate the Nominatum geoencoder
import pandas as pd
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here") # specify your application name

Generate some data with locations

d = ['New York, NY', 'Seattle, WA', 'Philadelphia, PA',
    'Richardson, TX', 'Plano, TX', 'Wylie, TX',
    'Waxahachie, TX', 'Washington, DC']
df = pd.DataFrame(d, columns=['Location'])

print(df)
           Location
0      New York, NY
1       Seattle, WA
2  Philadelphia, PA
3    Richardson, TX
4         Plano, TX
5         Wylie, TX
6    Waxahachie, TX
7    Washington, DC

Use a dict to geoencode only the unique Locations per this SO post

locations = df['Location'].unique()

# Create dict of geoencodings
d = (
    dict(zip(locations, pd.Series(locations)
        .apply(geolocator.geocode, args=(10,))
        .apply(lambda x: (x.latitude, x.longitude)) # get tuple of latitude and longitude
            )
        )
    )

# Map dict to `Location` column
df['city_coord'] = df['Location'].map(d)

# Split single column of tuples into multiple (2) columns
df[['lat','lon']] = pd.DataFrame(df['city_coord'].tolist(), index=df.index)

print(df)
           Location                  city_coord        lat         lon
0      New York, NY   (40.7308619, -73.9871558)  40.730862  -73.987156
1       Seattle, WA  (47.6038321, -122.3300624)  47.603832 -122.330062
2  Philadelphia, PA   (39.9524152, -75.1635755)  39.952415  -75.163575
3    Richardson, TX   (32.9481789, -96.7297206)  32.948179  -96.729721
4         Plano, TX   (33.0136764, -96.6925096)  33.013676  -96.692510
5         Wylie, TX   (33.0151201, -96.5388789)  33.015120  -96.538879
6    Waxahachie, TX   (32.3865312, -96.8483311)  32.386531  -96.848331
7    Washington, DC   (38.8950092, -77.0365625)  38.895009  -77.036563
edesz
  • 11,756
  • 22
  • 75
  • 123
  • Times out on me, I presume by calling too quickly ?! Do not know how to build a time delay into this construction :( – hmmmbob Mar 21 '19 at 00:27
  • @hmmmbob use `timeout=10` in the `apply` for `geocode`. See example usage [here](https://stackoverflow.com/a/27914845/4057186). This would set the timeout on the geocode call to 10seconds. – edesz Mar 21 '19 at 01:32
  • Unfortunately I have yet to learn this apply structure, I do not know where to place the timeout, In the link you provided( thank you for that) it is put in the form geocode(query, timeout=10), which unfortunately looks very different to me. – hmmmbob Mar 23 '19 at 01:35
  • @hmmmbob, please see the updated answer if this is what you're after. – edesz Mar 30 '19 at 19:07
0

Prepare the initial dataframe:

import pandas as pd

df = pd.DataFrame({
    'some_meta': [1, 2, 3, 4],
    'city': ['london', 'paris', 'London', 'moscow'],
})

df['city_lower'] = df['city'].str.lower()
df
Out[1]:
   some_meta    city city_lower
0          1  london     london
1          2   paris      paris
2          3  London     london
3          4  moscow     moscow

Create a new DataFrame with unique cities:

df_uniq_cities = df['city_lower'].drop_duplicates().to_frame()
df_uniq_cities
Out[2]:
  city_lower
0     london
1      paris
3     moscow

Run geopy's geocode on that new DataFrame:

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")

from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

df_uniq_cities['location'] = df_uniq_cities['city_lower'].apply(geocode)
# Or, instead, do this to get a nice progress bar:
# from tqdm import tqdm
# tqdm.pandas()
# df_uniq_cities['location'] = df_uniq_cities['city_lower'].progress_apply(geocode)

df_uniq_cities
Out[3]:
  city_lower                                           location
0     london  (London, Greater London, England, SW1A 2DU, UK...
1      paris  (Paris, Île-de-France, France métropolitaine, ...
3     moscow  (Москва, Центральный административный округ, М...

Merge the initial DataFrame with the new one:

df_final = pd.merge(df, df_uniq_cities, on='city_lower', how='left')
df_final['lat'] = df_final['location'].apply(lambda location: location.latitude if location is not None else None)
df_final['long'] = df_final['location'].apply(lambda location: location.longitude if location is not None else None)
df_final
Out[4]:
   some_meta    city city_lower                                           location        lat       long
0          1  london     london  (London, Greater London, England, SW1A 2DU, UK...  51.507322  -0.127647
1          2   paris      paris  (Paris, Île-de-France, France métropolitaine, ...  48.856610   2.351499
2          3  London     london  (London, Greater London, England, SW1A 2DU, UK...  51.507322  -0.127647
3          4  moscow     moscow  (Москва, Центральный административный округ, М...  55.750446  37.617494

The key to resolving your issue with timeouts is the geopy's RateLimiter class. Check out the docs for more details: https://geopy.readthedocs.io/en/1.18.1/#usage-with-pandas

KostyaEsmukov
  • 848
  • 6
  • 11