multithreading for data from dataframe pandas

Question

I'm struggling to use multithreading for calculating relatedness between list of customers who have different shopping items on their baskets. So I have a pandas data frame consists of 1,000 customers, which means that I have to calculate the relatedness 1 million times and this takes too long to process

An example of the data frame looks like this:

  ID     Item       
    1    Banana    
    1    Apple     
    2    Orange    
    2    Banana    
    2    Tomato    
    3    Apple     
    3    Tomato    
    3    Orange

Here is the simplefied version of the code:

import pandas as pd

def relatedness (customer1, customer2):
    # do some calculations to measure the relation between the customers

data= pd.read_csv(data_file)
customers_list= list (set(data['ID']))

relatedness_matrix = pd.DataFrame(index=[customers_list], columns=[customers_list])
for i in customers_list:
    for j in customer_list:
        relatedness_matrix.loc[i,j] = relatedness (i,j)

It's kind of unclear what you're asking. Do you think multithreading will make it sufficiently faster that it won't "take too long" any more? How much faster do you need? — Warren Dew, May 19 '16 at 04:47
I'm not sure if I have used the correct term. But what I need is to do as many items in the for loop as possible at one time in order to reduce the processing time. Thankx — goodX, May 19 '16 at 11:22
You can get some pointers on Python threading here http://stackoverflow.com/questions/2846653/how-to-use-threading-in-python but threading in Python doesn't generally improve efficiency due to the global interpreter lock. Your best bet for speedup is to rewrite your time consuming functions in C or C++ and compile them into a python module which will run much faster than python native code. — Warren Dew, May 19 '16 at 13:58

score 17 · Answer 1 · edited Apr 13 '17 at 12:50

17

Was looking for same problem about having heavy calculations using pandas DataFrame and found

DASK http://dask.pydata.org/en/latest/

(from this SO https://datascience.stackexchange.com/questions/172/is-there-a-straightforward-way-to-run-pandas-dataframe-isin-in-parallel)

Hope this helps

edited Apr 13 '17 at 12:50

Community

1
1

answered Aug 16 '16 at 05:29

GBrian

1,031
11
28

score 5 · Answer 2 · answered Nov 19 '19 at 06:24

5

Check out Modin: "Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical." https://modin.readthedocs.io/en/latest/

answered Nov 19 '19 at 06:24

CyberPlayerOne

3,078
5
30
51

Numba, a JIT compiler would be my go to, after trying Modin. You may need to convert to a np array, which isn't too expensive. – Ali Pardhan Oct 23 '20 at 17:21

multithreading for data from dataframe pandas

2 Answers2

Linked