0

I realize this question might've been asked before, but I didn't find solution that works specifically for strings and is relatively simple.

I have a data frame that has a column with a zip code that uses remote API to fetch details about this zip code. What I'm trying is to parallelize data fetching to perform it in multiple threads.

Simple example:

def get_cities_by_zip_code(zip):
    resp = requests.post(geo_svc_url, json={'query': """query GetZipCodeInformation($zip: Float!) {
  zipCode(zip: $zip) {
    ....
  }
}""", 'variables': {'zip': zip}})

    return resp.json()['data']['zipCode']


def location_options(df):
  resp = get_cities_by_zip_code(df['Zip code'])

  if resp is not None:
    df['City'] = resp['preferredName']
    df['Population'] = (next(x for x in resp['places'] if x['type'] == 'city') or { 'population': 'n/a' })['population']

  return df

def make_df():
  // A function that generates initial dataframe


df = make_df()

Then I have to apply location_options on df parallel. I tried a couple of solutions to achieve that. For example:

  1. Via multiprocessing
num_partitions = 20 #number of partitions to split dataframe
num_cores = 8 #number of cores on your machine

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df


df = parallelize_dataframe(df, location_options)

It doesn't work (not a full stacktrace).

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
TypeError: Object of type Series is not JSON serializable
  1. swifter - doesn't work with strings for some reasons, runs but only one thread.

Whether as simple

df = df.apply(location_options, axis=1)

works just fine. But it's single threaded.

blits
  • 280
  • 4
  • 20
  • Does this answer your question? [How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?](https://stackoverflow.com/questions/45545110/how-do-you-parallelize-apply-on-pandas-dataframes-making-use-of-all-cores-on-o) – Danila Ganchar Apr 16 '20 at 16:33
  • @DanilaGanchar it doesn't. I also tried the first solution from that question, but it didn't work for me (crashing) – blits Apr 16 '20 at 16:50

1 Answers1

0

I may have found a solution from the related post.

This one worked for me, while others didn't. I also had to this: https://github.com/darkskyapp/forecast-ruby/issues/13

blits
  • 280
  • 4
  • 20