I have some code that I'm trying to run in parallel in order to speed it up. In short, the script iterates through many files, and from each file pulls a data frame, does some calculations on the dataframe, then writes the file name and the extracted / calculated data to a temporary list, and then since this is temp list is overwritten as each new file is processed, it is appended to a master list which should include all processed files once the script is finished.
I have the main file processing code as a function. If I run the code normally, then the master file list is appropriately populated, but when I run it using Pool and map it is always empty.
for example
# some code to generate file list as file_list
master_list = []
def myfunc(fle):
with open fle as f: # long set of data extraction instructions
temp_list.insert(filename, 0)
temp_list.insert(data1, 1)
temp_list.insert(data2, 2)
print(temp_list) # check temp list in func works and it does
append.master_list(temp_list)
print(master_list) # master_list in func correctly contains temp_list data
If I call this function normally, everything works fine.
for i in file_list:
myfunc(i)
print (master_list) # master_list is populated with data from all files
but if I try to parallelise the function with pool.map the resulting master_list is empty even though all the correct data is present in the temp_list and this data is appended to the master_list (as I can see from the print statements inside the myfunc function).
pool = Pool(4)
pool.map(myfunc, file_list)
pool.close()
pool.join()
print(master_list) # master_list is empty
The odd thing is that this even happens when I limit pool to 1 processor pool = Pool(1)
Am I missing something about how pool and map work together? I thought it may be a queue problem, but then surely limiting to a single processor would fix the empty master_list which it doesn't?
Any advice welcome