17

I'm struggling to use multithreading for calculating relatedness between list of customers who have different shopping items on their baskets. So I have a pandas data frame consists of 1,000 customers, which means that I have to calculate the relatedness 1 million times and this takes too long to process

An example of the data frame looks like this:

  ID     Item       
    1    Banana    
    1    Apple     
    2    Orange    
    2    Banana    
    2    Tomato    
    3    Apple     
    3    Tomato    
    3    Orange    

Here is the simplefied version of the code:

import pandas as pd

def relatedness (customer1, customer2):
    # do some calculations to measure the relation between the customers

data= pd.read_csv(data_file)
customers_list= list (set(data['ID']))

relatedness_matrix = pd.DataFrame(index=[customers_list], columns=[customers_list])
for i in customers_list:
    for j in customer_list:
        relatedness_matrix.loc[i,j] = relatedness (i,j)
feetwet
  • 3,248
  • 7
  • 46
  • 84
goodX
  • 249
  • 1
  • 2
  • 12
  • 1
    It's kind of unclear what you're asking. Do you think multithreading will make it sufficiently faster that it won't "take too long" any more? How much faster do you need? – Warren Dew May 19 '16 at 04:47
  • I'm not sure if I have used the correct term. But what I need is to do as many items in the for loop as possible at one time in order to reduce the processing time. Thankx – goodX May 19 '16 at 11:22
  • You can get some pointers on Python threading here http://stackoverflow.com/questions/2846653/how-to-use-threading-in-python but threading in Python doesn't generally improve efficiency due to the global interpreter lock. Your best bet for speedup is to rewrite your time consuming functions in C or C++ and compile them into a python module which will run much faster than python native code. – Warren Dew May 19 '16 at 13:58

2 Answers2

17

Was looking for same problem about having heavy calculations using pandas DataFrame and found

DASK http://dask.pydata.org/en/latest/

(from this SO https://datascience.stackexchange.com/questions/172/is-there-a-straightforward-way-to-run-pandas-dataframe-isin-in-parallel)

Hope this helps

Community
  • 1
  • 1
GBrian
  • 1,031
  • 11
  • 28
5

Check out Modin: "Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical." https://modin.readthedocs.io/en/latest/

CyberPlayerOne
  • 3,078
  • 5
  • 30
  • 51
  • Numba, a JIT compiler would be my go to, after trying Modin. You may need to convert to a np array, which isn't too expensive. – Ali Pardhan Oct 23 '20 at 17:21