0

I have 10 pandas data frames that I am looping to apply a function and store results into CSV and .npz file. As these 10 data frames are independent, I am looking to parallelize the for loop using multiprocessing but unable to get output.

i = 1
dfs = [df_1,df_2, df_3, df_4,df_5,df_6,df_7,df_8,df_9,df_10]

for df in dfs:        
    X = df_to_sparse(df, Q, features)
    sparse.save_npz(os.path.join(data_path, f"X-{i}"), X)
    X1 = pd.DataFrame(X.todense())
    X1.to_csv('Features_'+str(i)+'.csv', index = False)
    i = i+1
Sunny
  • 23
  • 3
  • 1
    You need `threading` - https://docs.python.org/3/library/threading.html – Danail Petrov Dec 27 '20 at 18:21
  • Based on this link , I observed that threading doesn't help in my case. – Sunny Dec 27 '20 at 18:26
  • Multiprocessing is indeed the right solution for your problem. Are you using it in Windows on Jupyter ? Because there is a known issue in that case (no output). Multithreading is not always parrallel in Python because of the Global Interpreter Lock (aka the GIL). – Ismael EL ATIFI Dec 27 '20 at 20:06
  • @IsmaelELATIFI yes, I am on windows and running this in Jupyter notebook. I am also trying to dynamically name the csv file in multi processing. – Sunny Dec 27 '20 at 20:33
  • https://medium.com/@grvsinghal/speed-up-your-python-code-using-multiprocessing-on-windows-and-jupyter-or-ipython-2714b49d6fac – Ismael EL ATIFI Dec 27 '20 at 20:44
  • @Sunny According to [this](https://scipy-cookbook.readthedocs.io/items/ParallelProgramming.html#Threads), most numpy operations do not hold the GIL. Since Pandas spends a lot of time inside numpy, threading is useful for some CPU-bound workloads. It depends. – Nick ODell Dec 28 '20 at 04:40

0 Answers0