Parallelize independent iterations of for loop in Python

Question

I have 10 pandas data frames that I am looping to apply a function and store results into CSV and .npz file. As these 10 data frames are independent, I am looking to parallelize the for loop using multiprocessing but unable to get output.

i = 1
dfs = [df_1,df_2, df_3, df_4,df_5,df_6,df_7,df_8,df_9,df_10]

for df in dfs:        
    X = df_to_sparse(df, Q, features)
    sparse.save_npz(os.path.join(data_path, f"X-{i}"), X)
    X1 = pd.DataFrame(X.todense())
    X1.to_csv('Features_'+str(i)+'.csv', index = False)
    i = i+1

You need `threading` - https://docs.python.org/3/library/threading.html — Danail Petrov, Dec 27 '20 at 18:21
Based on this link , I observed that threading doesn't help in my case. — Sunny, Dec 27 '20 at 18:26
Multiprocessing is indeed the right solution for your problem. Are you using it in Windows on Jupyter ? Because there is a known issue in that case (no output). Multithreading is not always parrallel in Python because of the Global Interpreter Lock (aka the GIL). — Ismael EL ATIFI, Dec 27 '20 at 20:06
@IsmaelELATIFI yes, I am on windows and running this in Jupyter notebook. I am also trying to dynamically name the csv file in multi processing. — Sunny, Dec 27 '20 at 20:33
https://medium.com/@grvsinghal/speed-up-your-python-code-using-multiprocessing-on-windows-and-jupyter-or-ipython-2714b49d6fac — Ismael EL ATIFI, Dec 27 '20 at 20:44
@Sunny According to [this](https://scipy-cookbook.readthedocs.io/items/ParallelProgramming.html#Threads), most numpy operations do not hold the GIL. Since Pandas spends a lot of time inside numpy, threading is useful for some CPU-bound workloads. It depends. — Nick ODell, Dec 28 '20 at 04:40

Parallelize independent iterations of for loop in Python

0 Answers0