0

I have a large pandas Dataframe named "mj_flt" and want to do some stuff based on some columns in the Dataframe and append the result to an empty list. Since the Dataframe is too large, I will need to use a for loop to process the Dataframe in batches. The code I am trying to parallelise is the following:

start = np.array(np.arange(0,6900000,300000))
end = np.array(np.arange(300000,7200000,300000))
tim = []
for (i,j) in zip(start,end):
    for index, row in mj_flt[i:j].iterrows():
        ## do some stuff with row['a'],row['b'],row['c'],row['d']
        ## get a result based on the operation
        tim.append(result)

How can I use the Multiprocessing module and the Pool function to make this nested for loop parallelised?

Thx a lot!

Tommy Lee
  • 13
  • 2
  • 6

1 Answers1

0

You'll have to make a few things for it to work like expected. A pool of thread is a pool of waiting threads, waiting for a function and a parameter to exec. It also commonly have a waitlist of N element (adjustable) to stack the upcoming work. For the task you're doing you'll have to use as much threads as cores of your processor. More would not speed up the job.

Now to the code: you'll need a function taking a parameter which should contain all the datas needed for the function to work. Depending on how you'll manipulate the data you'll also need to use some locking system, be it with mutex locks, semaphores, whatever.

Before entering your for loop the thread pool should be allocated with cpu_cores threads and a waiting list as long as the maximum amount of function you want to pass to it, that or the add_work_to_thread_pool system should be blocking until some room is made by threads finishing their jobs.

Inside the for for loop you add function( parameter ) to the waiting list. The waiting list will be consumed by allocated_threads at a time.

After the for loop you have to wait that each thread is in a waiting state, and that the waiting list is empty to be sure all the job is done.

With the help of the python thread and wait list manual and some few google I think you can maybe now code it by yourself.

Else feel free to ask for some clarifications on specific points and then come back with code you tried to do and that is not working as expected. I mean code with threads. Not just the snippet you pasted.

Have a nice time, multi tasking is fun :-)

Gull_Code
  • 115
  • 1
  • 5