0

I have a script that loops over an array of numbers, those numbers are passed to a function which calls and API. It returns JSON data which is then written to a CSV.

for label_number in label_array:
    call_api(domain, api_call_1, api_call_2, label_number, api_key)

The list can be up to 7000 elements big, as the API takes a few seconds to respond this can take hours to run the entire script. Multiprocessing seems the way to go with this. I can't quite working out how to do this with the above loop. The documentation I am looking at is

https://docs.python.org/3.5/library/multiprocessing.html

I found a similar article at

Python Multiprocessing a for loop

But manipulating it doesn't seem to work, I think I am buggering it up when it comes to passing all the variables into the function.

Any help would be appreciated.

Community
  • 1
  • 1
LOFast
  • 3
  • 4
  • I am using 17.2.1.1. The Process class and it seems to go through the loop correctly, but I am getting ValueError: I/O operation on closed file. So it seems like writer is closing the file. – LOFast Oct 16 '15 at 05:04
  • Ok seem to have it working, not writing to CSV, but still seems quite slow. Possibly this isn't the right too to use. – LOFast Oct 16 '15 at 05:10
  • Open a `multiprocessing.Pool`, then `.map` it. Can't be easier than that – JBernardo Oct 16 '15 at 05:12
  • Post the multiprocessing version that is failing. Is call_api doing the file I/o? Code that access external resources such as file systems can be difficult to parallelize. – tdelaney Oct 16 '15 at 05:35

1 Answers1

1

Multiprocessing could help but this sounds more like a threading problem. Any IO implementation should be made asynchronous, which is what threading does. Better, in python3.4 onwards, you could do asyncio. https://docs.python.org/3.4/library/asyncio.html

If you have python3.5, this will be useful: https://docs.python.org/3.5/library/asyncio-task.html#example-hello-world-coroutine

You can mix asyncio with multiprocessing to get the optimized result. I use in addition joblib.

import multiprocessing
from joblib import Parallel, delayed 

def parallelProcess(i):
    for index, label_number in enumerate(label_array):
        if index % i == 0:
            call_api_async(domain, api_call_1, api_call_2, label_number, api_key)

if __name__=="__main__":
    num_cores_to_use = multiprocessing.cpu_count()
    inputs = range(num_cores_to_use)
    Parallel(n_jobs=num_cores_to_use)(delayed(parallelProcess)(i) for i in inputs)
lingxiao
  • 1,214
  • 17
  • 33
  • Agreed. The API query will interrupt your thread anyway, so other userspace threads can pop in and work in the meanwhile. – Dacav Oct 16 '15 at 08:24