3

So this may seem like an odd question, but I have a pandas DataFrame with addresses in it, that I want to geocode so I can get the latitude and longitude.

I have code that works using .apply() thanks to this very helpful thread (new column with coordinates using geopy pandas), but my problem is that all of the open APIs have strict limits to how many requests per second they allow, and also requests per day.

I haven't been able to find any way to throttle my code so match the limits of the APIs. My DF has 25K rows, but I've only been able to successfully geocode if I create a subset of it with up to 5 rows.

I don't have a lot of experience with python and pandas, but in SAS the DATA steps iterate one row at a time, so I could have a sleep command that would throttle the requests. What would be the best way to implement something like that with python/pandas?

EDIT: So based on the answers so far, I wanted to confirm, my code would change from: df_small['city_coord'] = df_small['Address'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
to:

df_small = df_clean[:5]
def f(x, delay=1):
# run your code    
sleep(delay)
return geolocator.geocode(x)

df_small['city_coord'] = df_small['Address'].apply(f).apply(lambda x: (x.latitude, x.longitude))

2 Answers2

5

To iterate with a delay, you can use df.iterrows() and time.sleep():

from time import sleep

for row in df.iterrows():
    # run your code
    sleep(1) # how many seconds to wait

Or you can just put time.sleep() within the apply function itself (as @RafaelC suggests in the comments):

def f(x, delay=1):
    # run your code
    sleep(delay)

df.apply(f)
ASGM
  • 11,051
  • 1
  • 32
  • 53
  • 1
    Why not put the `sleep` inside the function that is argument for `apply`? – rafaelc Apr 09 '18 at 16:50
  • Could you take a look at my recent edit so let me know if I am on the right track? – Michael Melillo Apr 09 '18 at 17:30
  • @MichaelMelillo you're still sub-setting the data before running the code, while the code should presumably work on the full dataframe (also the indents are wrong, but I imagine that's just a posting error). – ASGM Apr 09 '18 at 22:56
  • Thanks. The subset is because there are actually 2 limitations, one is per second and another is per day. So this code solves one problem, but I will have to determine another solution to allow for the full data set to be completed. But using your solution, I was able to geocode a much larger population. Thank you for the answer. – Michael Melillo Apr 10 '18 at 02:55
0

I had a similar problem, but the approved solution wouldn't work for me, as I wanted to make a batch of calls and wait a different amount of time than just waiting a second between calls. Let's say you want to send no more than 5 e-mails every minute. You can take the above approved answer approach by dividing the number of calls per minute and then sleeping for that many seconds between each call. In this case, you would have your function sleep for 12 seconds every time (60/5 = 12).

For my purposes (and possibly others'), I wanted to send all 5 requests and then wait a minute. In this case you can append your executable code with an if statement that checks your index modulo your interval is zero. You may need to set to start at 1 in order to work, if it isn't already: df.index = range(1, len(df) + 1)

 df(row, interval, time_delay):
     # execute your code
     if row.name % interval == 0:
     sleep(time_delay)
Victor Vulovic
  • 521
  • 4
  • 11