0

Because of big data frame (100k rows), I tried to use multiprocessing approach to speed up my calculation. However, this code is keep running in my Python notebook forever and I could not terminate at all. The usage of my CPU (8 processors) is only 5%.

def func(df):
    df_result = df.apply(lambda row: fill_simulated_data(row, df1_before_last_30days), axis = 1)
    return df_result

def parallelize_dataframe(df, func):
    num_cores = multiprocessing.cpu_count()-1  #leave one free to not freeze machine
    num_partitions = num_cores #number of partitions to split dataframe
    df_split = np.array_split(df, 2)
    pool = multiprocessing.Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

start = timeit.default_timer()

df1_last_30days_test  = df1_last_30days.iloc[0:1000]

result = parallelize_dataframe(df1_last_30days_test,func)

stop = timeit.default_timer()
print 'Process was done in: ' + str(stop - start) + ' seconds'

Without multiplprocessing, my function takes around 5.9s for a small dataframe (100 rows). What am I doing wrong here ?

Duy H.L
  • 93
  • 1
  • 1
  • 3
  • Possible duplicate of [Jupyter notebook never finishes processing using multiprocessing (Python 3)](https://stackoverflow.com/questions/47313732/jupyter-notebook-never-finishes-processing-using-multiprocessing-python-3) – Milan Velebit Aug 29 '18 at 08:08
  • @MilanVelebit thanks – Duy H.L Aug 29 '18 at 12:00

0 Answers0