1

I want to update pandas

Hello, I want to compare the speeds of single-core and multicore in pandas dataframe calculations. The following cases are given, The column'c' in the 'i'th-row is the average of the values โ€‹โ€‹of 'a' from 'i-9'-row to 'i'th-row.

from multiprocessing import Process, Value, Array, Manager
import pandas as pd
import numpy as np
import time 

total_num = 1000
df = pd.DataFrame(np.arange(1,total_num*2+1).reshape(total_num,2),
              columns=['a','b'])
df['c']=0


df2 = pd.DataFrame(np.arange(1,total_num*2+1).reshape(total_num,2),
              columns=['a','b'])
df2['c']=0


def Cal(start, end):
    for i in range(end-start-1):
        if i+start < 10:
            df.loc[i+start,'c']=df.loc[:i+start,'c'].mean()
        else :
            df.loc[i+start,'c']=df.loc[i-9:i+start,'c'].mean()

def Cal2(my_df,start, end):
    for i in range(end-start-1):
        if i+start < 10:
            my_df.df.loc[i+start,'c']=my_df.df.loc[:i+start,'c'].mean()
        else :
            my_df.df.loc[i+start,'c']=my_df.df.loc[i-9:i+start,'c'].mean()
    print(my_df)

print('Single core : --->')
start_t = time.time()

Cal(0,total_num+1)

end_t = time.time()
print(end_t-start_t)

print('Multiprocess ---->')

if __name__=='__main__':
    num=len(df2)
    num_core=4 
    between=num//num_core

    mgr=Manager()
    ns = mgr.Namespace()
    ns.df=df2
    procs=[]

    start_t =time.time()

    for index in range(num_core):
        proc=Process(target=Cal2,args=(ns,index*between,(index+1)*between))
        procs.append(proc)
        proc.start()

    for proc in procs:
        proc.join()

    end_t = time.time()
    print(end_t-start_t)

At first I realized that Multiprocessing does not use global variables. So I used Manager. However, the 'c'column of df2 did not change.

How do I do what I want to do? :p

  • This does not look like a good idea. Just take the mean of multiple columns with `df.loc[:10, 'd'].mean()` and `df.loc[10:, 'd'].mean()` โ€“ JE_Muc Jan 30 '19 at 13:53

1 Answers1

0

You may look at swifter as well, iit applies functions using multiprocessing IF it helps in faster code execution.

In your case it is a terrible idea, 10 is a really small amount of data so distributing it between cores will not help and cost of processes will be much higher than operations.

Furthermore, memory sharing is not a good idea between processes (as this is really costly), and that's what you are trying to do here (usually you split data beforehand and push it to multiprocessing functions like applymap, but once again, data chunks should be much bigger).

You could use threads, as those are the ones you may be after, but remember about Python's GIL (you may read about threads, processes and GIL in other answers, e.g. here)

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83