2

This is part 2 of a problem I am trying to solve to geocode a DataFrame of addresses while staying within the thresholds to the API providers. Yesterday I got help with the requests per second limitation (Is there a slower or more controlled alternative to .apply()?), but now I need to solve for the daily limit. My DataFrame has roughly 25K rows, and the daily limit is 2,500, so I need to split it approximately 10 times. Since I consume a certain amount of daily requests with debugging and development, I think it's safe to split into chunks of 2K. Here is what I have so far:

def f(x, delay=5): 
    # Function for .apply() with speed limiter    
    sleep(delay)
    return geolocator.geocode(x)

for g, df in df_clean.groupby(np.arange(len(df_clean)) // 2000):
    df['coord'] = df['Address'].apply(f).apply(lambda x: (x.latitude, x.longitude))
    sleep(untilNextDay)

So what I don't know how to do, is stitch those chunked dataframes back together. I could write them out to csv I guess, but I'm sure there has to be a better way.

Graipher
  • 6,891
  • 27
  • 47
  • Storing it in csvs might be better than storing it in memory overnight: that way if your computer crashes you don't lose a day's work. If not: you can use [pandas concat](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) to put the dataframes back together. – Seth Rothschild Apr 10 '18 at 14:39

2 Answers2

2

Add the dataframes to a list and concatenate them back later.

df_list = []

df_list.append(df['coord'])

pd.concat(df_list, ignore_index=True, axis=1)
A.Kot
  • 7,615
  • 2
  • 22
  • 24
1

Something I've done to limit RAM utilization. Create a dictionary of dataframes .You can change the 500000 to 2500 maybe.

df_dict = {n: df.iloc[n:n+500000, :] 
           for n in range(0, len(df), 500000)}

Then of course just use whatever dict you'd like for processing. Or hit all the keys with a 24 hour sleep in between or something.

jrjames83
  • 901
  • 2
  • 9
  • 22