I'm looking to process loops in multi-thread way. The loop for each key is done independently, and collected in the final result dictionary.
My atttempt so far, inspired from How do I parallelize a simple Python loop?
import time
from joblib import Parallel, delayed
#global dict to capture all result
res_dict = {}
#list of keys to be processed independently and in parallel
keys = ["k1", "k2", "k3"]
for k in keys :
res_dict[k] = None
def process(key,i):
#modifying the global dict, for that specific key
global res_dict
print("Processing ",key, i)
res_dict[key] = key + str(i*i)
#return i * i
Parallel(n_jobs=4)(delayed(process)(k, 5) for k in keys)
Result is ["k1":None, "k2": None, "k3":None]
as initialization. Nothing is changed after processed.
Any thoughts ? Any recommendation to do it better? Thanks
UPDATES ATTEMPT 2
Learned that READ is working by MUTATE does not
res_dict = {}
keys = list(range(0,5))
for k in keys :
res_dict[k] = k
def process(k,i):
global res_dict
#try reading and re-assigning back
print(k,i,)
#assigning back is not working here, but reading DOES
res_dict[k]= res_dict[k] +2
return res_dict[k]
tmp_res = Parallel(n_jobs=2)(delayed(process)(k, 5) for k in keys)
print(tmp_res) #[2, 3, 4, 5, 6]
#assign result back to overwrite
for idx,k in enumerate(keys) :
res_dict[k] = tmp_res[idx]
Concerns :
- tmp_res return a list, SEEMINGLY collect the for loop IN ORDER. I'm not sure if this is true given the parallel nature.
- In reality, each res_dict[k] is a dataframe. I have ~100 of them to be processed ASAP, hence the intention to multithread. With this assigning every time,
res_dict[k] = tmp_res[idx]
will there be memory issue ? Every time the older df version stays in memory but no longer used.