Doing multithread on pandas python

Question

I have a pandas dataframe consist of 1 million of rows. this data consist of transaction history of customers, hence, 1 customer might have multiple rows. For each customer, I have a function to run, so I need to use apply lambda function using groupby customer ID. But how do I speed up the process with multithreading? my machine has 8 CPU cores, and I wish to use all of them, currently I am only able to use 1 core. Let's say in those 1 million of rows, and I have total 100k unique customers, I wish to execute 12.5k customers for each CPU core, it would make the process 8 times faster!

Thanks!!

Have you looked at dask? You won't get near as 8x speedup though due to processing overheads. What function are you running on every row? — el_oso, May 11 '21 at 21:01

M. Sch. · Answer 1 · 2021-05-11T21:12:54.270

Take a look at the concurrent.futures module from python. It is a wrapper for the threading and multiprocessing libraries with (in my opinion) a simple interface. I would propose to not use multiple threads but multiple processes for a speed-up.

After aggregating the customers by ID, you can pass the customer data to a function which is the target of a new process. You can do this using submit(<args>) on an executor object, which returns a future object and starts a new process. When calling .result() on the future object, the current process will wait until the function in the new process has returned and return the result.

Example:

import concurrent.futures

def fancy_function(data):
  # Do something
  return 42 

processes = []
# let DF be the dataframe
with concurrent.futures.ProcessPoolExecutor() as executor:
  for name, group in DF.groupby('customerId'):
    processes.append(executor.submit(fancy_function, group))
results = []
for process in processes:
  results.append(process.result())

Some sources:

Doing multithread on pandas python

1 Answers1