1

I want to invoke a list of ~5000 API calls which return JSON data and write it to a file.I'm expecting the total response of all API calls to be quite large, so I want to avoid reading them into memory all at once. I'm attempting to use a multiprocessing.Pool to distribute the tasks across 8 processes (# of available CPU cores). I'm expecting this to take 20+ hours to complete, since each worker will be fetching paginated responses starting from these URLs.

The 8 workers are spun up and retrieve data for 2-3 hours (each producing logs from different pid) and then somehow all the workers/processes move the "sleeping" status. No more logs are produced and no exceptions are caught.

I'm using imap_unordered to process the responses as they become available, and writing the data to a file. Why are the processes sleeping after a set number of tasks? Is there an optimization I can make to the following to ensure each task is completed? Note:


urls_to_query = [...]  # list of 5000 strings

def call_api(url):
    response_json = call_some_api(url)
    return response_json

with Pool() as pool:  # 8 CPU cores available
    total = len(urls_to_query)
    # iterates results as tasks are completed, in the order they are completed
    for i, result in enumerate(pool.imap_unordered(call_api, urls_to_query)):
        print(f"Response {i + 1}/{total} - {result}")

        if isinstance(result, Exception):
            handle_failure(result)
        else:
            write_to_file(result)
        

If there's a much more optimized way to accomplish this task, more than happy to try that out instead.

I've attempted to use other resources in multiprocessing module but no luck thus far.

UPDATE:

Each worker is now logging the memory available and I do not see any memory concerns. Each worker is hovering between 70-60% memory available.

  • If you are on Linux you can use `top` or `htop` to see if the processes are still running. Also, have you tried using Queues instead of Pools? Check this question https://stackoverflow.com/questions/32065465/python-multiprocessing-processes-sleep-after-a-while – ccolin Feb 04 '23 at 04:53
  • Maybe you are running out of memory, in which case, make sure that call to `write_to_file` closes the file after any write operations. Have you logged or tracked the memory metrics in your system when running this to see if the use of memory grows when the program is executing? – ccolin Feb 04 '23 at 04:55
  • @ccolin yes I'm monitoring `top` as I'm running this. I see the processes running for couple hours then they all go to sleeping state and 0% CPU usage. – Karanvir Ghumaan Feb 04 '23 at 06:38
  • @KaranvirGhumaan How did you solve it? – chaostheory Jul 07 '23 at 21:01

0 Answers0