0

This question is not new and was discussed multiple times, but I am new to Python.

Geopy too slow - timeout all the time

Timeout error in Python geopy geocoder

I have a dataset of 11000 geolocations and would like to have their zipcodes.

My data looks like:

   Longitude   Latitude
0 -87.548627  41.728184
1 -87.737227  41.749111
2 -87.743974  41.924143
3 -87.659294  41.869314
4 -87.727808  41.877007

Using this question, I wrote a function, which works for the first 10-20 rows, but gives a timeout error.

# Create a function for zip codes extraction
def get_zipcode(df, geolocator, lat_field, lon_field):
   location = geolocator.reverse((df[lat_field], df[lon_field]))
   return location.raw['address']['postcode']

geolocator = geopy.Nominatim(user_agent = 'my-application')

# Test a sample with 20 rows
test = bus_stops_geo.head(20)

# Extract zip codes for the sample
zipcodes = test.apply(get_zipcode, axis = 1, geolocator = geolocator, 
                           lat_field = 'Latitude', lon_field = 'Longitude')

print(zipcodes)

0     60617
1     60652
2     60639
3     60607
4     60644
5     60659
6     60620
7     60626
8     60610
9     60660
10    60625
11    60645
12    60628
13    60620
14    60629
15    60628
16    60644
17    60638
18    60657
19    60631
dtype: object

I tried to change the timeout time, but failed so far.

My questions:

  • How to achieve this for 11000 rows?
  • How to rewrite this function and return not only zips, but initial long and lat too?
  • Any simple alternative solutions in programming languages like R or using proprietary software (paid options work for me)?

Tremendously appreciate any help!

Anakin Skywalker
  • 2,400
  • 5
  • 35
  • 63

2 Answers2

1

Usage of geopy with pandas is described in docs: https://geopy.readthedocs.io/en/1.22.0/#usage-with-pandas

Solution with geopy:

In [1]: import pandas as pd
   ...:
   ...: df = pd.DataFrame([
   ...:     [-87.548627, 41.728184],
   ...:     [-87.737227, 41.749111],
   ...:     [-87.743974, 41.924143],
   ...:     [-87.659294, 41.869314],
   ...:     [-87.727808, 41.877007],
   ...: ], columns=["Longitude", "Latitude"])

In [2]: from tqdm import tqdm
   ...: tqdm.pandas()
   ...:
   ...: from geopy.geocoders import Nominatim
   ...: geolocator = Nominatim(user_agent="specify_your_app_name_here")
   ...:
   ...: from geopy.extra.rate_limiter import RateLimiter
   ...: reverse = RateLimiter(geolocator.reverse, min_delay_seconds=1)

In [3]: df['Location'] = df.progress_apply(
   ...:     lambda row: reverse((row['Latitude'], row['Longitude'])),
   ...:     axis=1
   ...: )
100%|█████████████████████████████| 5/5 [00:06<00:00,  1.24s/it]

In [4]: def parse_zipcode(location):
   ...:     if location and location.raw.get('address') and location.raw['address'].get('postcode'):
   ...:         return location.raw['address']['postcode']
   ...:     else:
   ...:         return None
   ...: df['Zipcode'] = df['Location'].apply(parse_zipcode)

In [5]: df
Out[5]:
   Longitude   Latitude                                           Location Zipcode
0 -87.548627  41.728184  (Olive Harvey College South Chicago Learning C...   60617
1 -87.737227  41.749111  (7900, South Kilpatrick Avenue, Chicago, Cook ...   60652
2 -87.743974  41.924143  (4701, West Fullerton Avenue, Beat 2522, Belmo...   60639
3 -87.659294  41.869314  (1301-1307, West Taylor Street, Near West Side...   60607
4 -87.727808  41.877007  (4053, West Jackson Boulevard, West Garfield P...   60644

If paid options work for you, consider using any other geocoding service than the free Nominatim, such as MapQuest or PickPoint.

KostyaEsmukov
  • 848
  • 6
  • 11
  • Kostya, what is the limit of your solution? In terms of returned zip codes? Paid works if there is a good guideline on how to do it. Thanks! – Anakin Skywalker May 25 '20 at 20:15
  • 1
    Nominatim has a limit on rate (1 request per second), see https://operations.osmfoundation.org/policies/nominatim/ . MapQuest and PickPoint limit by number of queries, see https://developer.mapquest.com/plans and https://pickpoint.io/ – KostyaEsmukov May 26 '20 at 18:32
0

This is not a Python solution and a different approach with shapefiles, but for benefits of the SO community I will post an answer with R (a tip from @Jaap and this answer).

First, you need the US zipcodes shapefile (for US location).

# Install packages
library(sp)
library(rgdal)
library(tidyverse)

# Import zips shapefile and transform CRS 
zips <- readOGR("cb_2015_us_zcta510_500k.shp")
zips <- spTransform(zips, CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

# Read bus stops data
bus_stops <- read_csv("CTA_BusStops.csv")

# A tibble: 6 x 14
SYSTEMSTOP OBJECTID the_geom                            STREET     CROSS_ST          
DIR   POS   ROUTESSTPG OWLROUTES CITY   STATUS PUBLIC_NAM          POINT_X POINT_Y
   <dbl>    <dbl> <chr>                               <chr>      <chr>             
<chr> <chr> <chr>      <chr>     <chr>   <dbl> <chr>                 <dbl>   <dbl>
1      11953      193 POINT (-87.54862703700002 41.72818~ 92ND STRE~ BALTIMORE         
EB    NS    95         NA        CHICA~      1 92nd Street & Balt~   -87.5    41.7
2       2723      194 POINT (-87.737227163 41.7491110710~ 79TH STRE~ KILPATRICK (east~ 
EB    NS    79         NA        CHICA~      1 79th Street & Kilp~   -87.7    41.7
3       1307      195 POINT (-87.74397362600001 41.92414~ FULLERTON  KILPATRICK        
EB    NS    74         NA        CHICA~      1 Fullerton & Kilpat~   -87.7    41.9
4       6696      196 POINT (-87.65929365400001 41.86931~ TAYLOR     THROOP            
EB    NS    157        NA        CHICA~      1 Taylor & Throop       -87.7    41.9
5         22      197 POINT (-87.72780787099998 41.87700~ JACKSON    KARLOV            
EB    FS    126        NA        CHICA~      1 Jackson & Karlov      -87.7    41.9
6       4767      198 POINT (-87.71978000000001 41.97552~ FOSTER     MONTICELLO        
EB    NS    92         NA        CHICA~      1 Foster & Monticello   -87.7    42.0

# Rename two columns for long and lat
bus_stops <- rename(bus_stops, longitude = POINT_X, latitude = POINT_Y)

# Extract long and lat columns only
test_bus_stops <- bus_stops[ , c(13, 14)]

# Transform coordinates into a SpatialPointsDataFrame
bus_stops_spatial <- SpatialPointsDataFrame(coords = test_bus_stops, data = bus_stops, 
proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))
class(bus_stops_spatial) # [1] "SpatialPointsDataFrame"

# Subset only the zipcodes in which points are found
zips_subset <- zips[bus_stops_spatial, ]

# NOTE: the column in zips_subset containing zipcodes is ZCTA5CE10
# Use over() to overlay points in polygons and then add that to the original dataframe
# Create a separate column
bus_stops_zip <- over(bus_stops_spatial, zips_subset[, "ZCTA5CE10"]) 
bus_stops_zip 

I double-checked results long and lat via this link randomly. It matches.

# Merge bus stops with bus stop zips
bus_stops <- cbind(bus_stops, bus_stops_zip)

# Rename zip_code column
bus_stops <- rename(bus_stops, zip_code = ZCTA5CE10)
head(bus_stops)

# Display zip
head(bus_stops[, 13:15])
longitude latitude zip_code
1 -87.54863 41.72818    60617
2 -87.73723 41.74911    60652
3 -87.74397 41.92414    60639
4 -87.65929 41.86931    60607
5 -87.72781 41.87701    60624
6 -87.71978 41.97553    60625

# Extract as csv
write.csv(bus_stops,'bus_stops_zips.csv')
Anakin Skywalker
  • 2,400
  • 5
  • 35
  • 63