0

i have a pandas dataframe which consists of approximately 1M rows , it contains information entered by users. i wrote a function that validates if the number entered by the user is correct or not . what im trying to do, is to execute the function on multiple processors to overcome the issue of doing heavy computation on a single processor. what i did is i split my dataframe into multiple chunks where each chunk contains 50K rows and then used the python multiprocessor module to perform the processing on each chunk separately . the issue is that only the first process is starting and its still using one processor instead of distributing the load on all processors . here is the code i wrote :

 pool = multiprocessing.Pool(processes=16)
 r7 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[0],fields ,dictionary))
 r8 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[1],fields ,dictionary))
 print(r7.get())
 print(r8.get())
 pool.close()
 pool.join()

i have attached a screenshot that shows how the CPU usage when executing the above code enter image description here

any advice on how can i overcome this issue?

2 Answers2

0

I suggest you try this:

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor() as executor:
    params = [(pnl, fields, dictionary) for pnl in has_phone_num_list]
    for result in executor.map(validate.validate_phone_number, params):
        pass # process results here

By constructing the ProcessPoolExecutor with no parameters, most of your CPUs will be fully utilised. This is a very portable approach because there's no explicit assumption about the number of CPUs available. You could, of course, construct with max_workers=N where N is a low number to ensure that a minimal number of CPUs are used concurrently. You might do that if you're not too concerned about how long the overall process is going to take.

DarkKnight
  • 19,739
  • 3
  • 6
  • 22
0

As suggested in this answer, you can use pandarallel for using Pandas' apply function in parallel. Unfortunately as I cannot try your code I am not able to find the problem. Did you try to use less processors (like 8 instead of 16)?

Note that in some cases the parallelization doesn't work.

Flavio
  • 121
  • 6