multiprocessing a function execution with python

Question

i have a pandas dataframe which consists of approximately 1M rows , it contains information entered by users. i wrote a function that validates if the number entered by the user is correct or not . what im trying to do, is to execute the function on multiple processors to overcome the issue of doing heavy computation on a single processor. what i did is i split my dataframe into multiple chunks where each chunk contains 50K rows and then used the python multiprocessor module to perform the processing on each chunk separately . the issue is that only the first process is starting and its still using one processor instead of distributing the load on all processors . here is the code i wrote :

 pool = multiprocessing.Pool(processes=16)
 r7 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[0],fields ,dictionary))
 r8 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[1],fields ,dictionary))
 print(r7.get())
 print(r8.get())
 pool.close()
 pool.join()

i have attached a screenshot that shows how the CPU usage when executing the above code

any advice on how can i overcome this issue?

you could use `Pool.map()`. That would allow you to asynchronously map the iterable onto the thread pool — 2pichar, Feb 07 '22 at 17:00
Just to clarify my understanding, are you saying that *has_phone_num_list* has 20 elements - i.e., 20 * 50_000 == 1_000_000 — DarkKnight, Feb 07 '22 at 17:12
looks like most of the work is still done in the main process (only one cpu core is loaded). Sending large amounts of data (arguments and return values) to each child process is rather inefficient, which happens in the main process. — Aaron, Feb 07 '22 at 21:16

score 0 · Answer 1 · answered Feb 07 '22 at 17:20

I suggest you try this:

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor() as executor:
    params = [(pnl, fields, dictionary) for pnl in has_phone_num_list]
    for result in executor.map(validate.validate_phone_number, params):
        pass # process results here

By constructing the ProcessPoolExecutor with no parameters, most of your CPUs will be fully utilised. This is a very portable approach because there's no explicit assumption about the number of CPUs available. You could, of course, construct with max_workers=N where N is a low number to ensure that a minimal number of CPUs are used concurrently. You might do that if you're not too concerned about how long the overall process is going to take.

score 0 · Answer 2 · answered Feb 07 '22 at 17:20

As suggested in this answer, you can use pandarallel for using Pandas' apply function in parallel. Unfortunately as I cannot try your code I am not able to find the problem. Did you try to use less processors (like 8 instead of 16)?

Note that in some cases the parallelization doesn't work.

multiprocessing a function execution with python

2 Answers2