Forgive me for my ignorance. I have a very big list of data that I want to process. My script is working but very slow, therefore I want to switch to multiprocessing. In total my script has to run for ~30 days, but I want to speed up this process. My original script has been optimized, so that it appends the list during the looping. This is very useful as it allows me to turn of my computer when I need to. I will give a random working example to reproduce the problem. Be careful with the following piece of code as it might slow down your computer. The idea is to stop the following code somewhere whilst running.
from joblib import Parallel, delayed
import multiprocessing
from tqdm import tqdm
num_cores = multiprocessing.cpu_count()
Crazy_long_list=list(range(0,10000000,1))
def get_data(i):
return ((i * 2),(i/2))
normal_output=[]
# Without parallel processing
for i in Crazy_long_list:
normal_output.append(get_data(i))
When this script is stopped at any point, the appended list remains. Therefore I can save the desired_output to .csv, and the next day, load it in, and run the script further from where I ended. You can test this by running the part of code and stopping it at a random moment. The data should be saved.
print(normal_output[0:10])
[(0, 0.0),
(2, 0.5),
(4, 1.0),
(6, 1.5),
(8, 2.0),
(10, 2.5),
(12, 3.0),
(14, 3.5),
(16, 4.0),
(18, 4.5)]
My approach to do this in parallel would look something like this:
parallel_data=[]
parallel_data=Parallel(n_jobs=num_cores, verbose=50)(delayed(get_data)(i)for i in tqdm(Crazy_long_list))
Unfortunately, when I stop the script after running for a while, there is no data found in the parallel_data list. It seems to only work, when the entire list has run through. Who can help me here??
print(parallel_data)
[]