I want to invoke a list of ~5000 API calls which return JSON data and write it to a file.I'm expecting the total response of all API calls to be quite large, so I want to avoid reading them into memory all at once. I'm attempting to use a multiprocessing.Pool
to distribute the tasks across 8 processes (# of available CPU cores). I'm expecting this to take 20+ hours to complete, since each worker will be fetching paginated responses starting from these URLs.
The 8 workers are spun up and retrieve data for 2-3 hours (each producing logs from different pid) and then somehow all the workers/processes move the "sleeping" status. No more logs are produced and no exceptions are caught.
I'm using imap_unordered to process the responses as they become available, and writing the data to a file. Why are the processes sleeping after a set number of tasks? Is there an optimization I can make to the following to ensure each task is completed? Note:
urls_to_query = [...] # list of 5000 strings
def call_api(url):
response_json = call_some_api(url)
return response_json
with Pool() as pool: # 8 CPU cores available
total = len(urls_to_query)
# iterates results as tasks are completed, in the order they are completed
for i, result in enumerate(pool.imap_unordered(call_api, urls_to_query)):
print(f"Response {i + 1}/{total} - {result}")
if isinstance(result, Exception):
handle_failure(result)
else:
write_to_file(result)
If there's a much more optimized way to accomplish this task, more than happy to try that out instead.
I've attempted to use other resources in multiprocessing module but no luck thus far.
UPDATE:
Each worker is now logging the memory available and I do not see any memory concerns. Each worker is hovering between 70-60% memory available.