Problems writing to a list outside of a function when using multiprocessing map

Question

I have some code that I'm trying to run in parallel in order to speed it up. In short, the script iterates through many files, and from each file pulls a data frame, does some calculations on the dataframe, then writes the file name and the extracted / calculated data to a temporary list, and then since this is temp list is overwritten as each new file is processed, it is appended to a master list which should include all processed files once the script is finished.

I have the main file processing code as a function. If I run the code normally, then the master file list is appropriately populated, but when I run it using Pool and map it is always empty.

for example

# some code to generate file list as file_list

master_list = []

def myfunc(fle):
    with open fle as f: # long set of data extraction instructions

    temp_list.insert(filename, 0)
    temp_list.insert(data1, 1)
    temp_list.insert(data2, 2)

    print(temp_list) # check temp list in func works and it does

    append.master_list(temp_list)

    print(master_list) # master_list in func correctly contains temp_list data

If I call this function normally, everything works fine.

for i in file_list:
    myfunc(i)

print (master_list) # master_list is populated with data from all files

but if I try to parallelise the function with pool.map the resulting master_list is empty even though all the correct data is present in the temp_list and this data is appended to the master_list (as I can see from the print statements inside the myfunc function).

pool = Pool(4) 
pool.map(myfunc, file_list)
pool.close()
pool.join() 

print(master_list) # master_list is empty

The odd thing is that this even happens when I limit pool to 1 processor pool = Pool(1)

Am I missing something about how pool and map work together? I thought it may be a queue problem, but then surely limiting to a single processor would fix the empty master_list which it doesn't?

Any advice welcome

Possible duplicate of [Python:Appending to the same list from different processes using multiprocessing](https://stackoverflow.com/questions/42490368/pythonappending-to-the-same-list-from-different-processes-using-multiprocessing) — naivepredictor, Apr 01 '19 at 10:15
You are trying to access same object = same memory address in the same time from each process when each process shall manage its own memory area. Try to you multiprocessing.Manager.list — naivepredictor, Apr 01 '19 at 10:17
https://jeffknupp.com/blog/2013/06/30/pythons-hardest-problem-revisited/ — naivepredictor, Apr 01 '19 at 10:20

score 0 · Answer 1 · answered Apr 01 '19 at 10:19

0

Try returning the list from myfunc without appening it to the master list , and then do:

master_list = pool.map(myfunc, file_list)

In short, appending to a list does not work when running multithreaded. So, you should return individual list from the function, and then pool it together using the pool object.

answered Apr 01 '19 at 10:19

Dipan Ghosh

176
2
9

There is still a possibility at least two processes might want to access file_list in same time. – naivepredictor Apr 01 '19 at 10:22
Additionaly, you mix concepts of multithreading with multiprocessing. – naivepredictor Apr 01 '19 at 10:23
I will try this, but will it not also be a problem that each process running myfunc will return a list with the same name (i.e. temp_list) so the last one to complete will overwrite the previous ones? – its_broke_again Apr 01 '19 at 11:07
Thanks and for others - I couldn't find good examples of Manager with the Pool function (most examples use Process(target...). What worked is declaring 'manager = Manager ()' then 'master_list = manager.list()' before defining myfunc – its_broke_again Apr 01 '19 at 11:59

Problems writing to a list outside of a function when using multiprocessing map

1 Answers1