I have a function which request a server, retrieves some data, process it and saves a csv file. This function should be launch 20k times. Each execution last differently: sometimes It last more than 20 minutes and other less than a second. I decided to go with multiprocessing.Pool.map
to parallelize the execution. My code looks like:
def get_data_and_process_it(filename):
print('getting', filename)
...
print(filename, 'has been process')
with Pool(8) as p:
p.map(get_data_and_process_it, long_list_of_filenames)
Looking at how prints
are generated it seems that long_list_of_filenames
it's been splited into 8 parts and assinged to each CPU
because sometimes is just get blocked in one 20 minutes execution with no other element of long_list_of_filenames
been processed in those 20 minutes. What I was expecting is map
to schedule each element in a cpu core in a FIFO style.
Is there a better approach for my case?