I realize this question might've been asked before, but I didn't find solution that works specifically for strings and is relatively simple.
I have a data frame that has a column with a zip code that uses remote API to fetch details about this zip code. What I'm trying is to parallelize data fetching to perform it in multiple threads.
Simple example:
def get_cities_by_zip_code(zip):
resp = requests.post(geo_svc_url, json={'query': """query GetZipCodeInformation($zip: Float!) {
zipCode(zip: $zip) {
....
}
}""", 'variables': {'zip': zip}})
return resp.json()['data']['zipCode']
def location_options(df):
resp = get_cities_by_zip_code(df['Zip code'])
if resp is not None:
df['City'] = resp['preferredName']
df['Population'] = (next(x for x in resp['places'] if x['type'] == 'city') or { 'population': 'n/a' })['population']
return df
def make_df():
// A function that generates initial dataframe
df = make_df()
Then I have to apply location_options
on df
parallel. I tried a couple of solutions to achieve that. For example:
- Via
multiprocessing
num_partitions = 20 #number of partitions to split dataframe
num_cores = 8 #number of cores on your machine
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
df = parallelize_dataframe(df, location_options)
It doesn't work (not a full stacktrace).
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
TypeError: Object of type Series is not JSON serializable
swifter
- doesn't work with strings for some reasons, runs but only one thread.
Whether as simple
df = df.apply(location_options, axis=1)
works just fine. But it's single threaded.