0

I have some code that I'm trying to run in parallel in order to speed it up. In short, the script iterates through many files, and from each file pulls a data frame, does some calculations on the dataframe, then writes the file name and the extracted / calculated data to a temporary list, and then since this is temp list is overwritten as each new file is processed, it is appended to a master list which should include all processed files once the script is finished.

I have the main file processing code as a function. If I run the code normally, then the master file list is appropriately populated, but when I run it using Pool and map it is always empty.

for example

# some code to generate file list as file_list

master_list = []

def myfunc(fle):
    with open fle as f: # long set of data extraction instructions

    temp_list.insert(filename, 0)
    temp_list.insert(data1, 1)
    temp_list.insert(data2, 2)

    print(temp_list) # check temp list in func works and it does

    append.master_list(temp_list)

    print(master_list) # master_list in func correctly contains temp_list data

If I call this function normally, everything works fine.

for i in file_list:
    myfunc(i)

print (master_list) # master_list is populated with data from all files

but if I try to parallelise the function with pool.map the resulting master_list is empty even though all the correct data is present in the temp_list and this data is appended to the master_list (as I can see from the print statements inside the myfunc function).

pool = Pool(4) 
pool.map(myfunc, file_list)
pool.close()
pool.join() 

print(master_list) # master_list is empty

The odd thing is that this even happens when I limit pool to 1 processor pool = Pool(1)

Am I missing something about how pool and map work together? I thought it may be a queue problem, but then surely limiting to a single processor would fix the empty master_list which it doesn't?

Any advice welcome

its_broke_again
  • 319
  • 4
  • 12
  • 2
    Possible duplicate of [Python:Appending to the same list from different processes using multiprocessing](https://stackoverflow.com/questions/42490368/pythonappending-to-the-same-list-from-different-processes-using-multiprocessing) – naivepredictor Apr 01 '19 at 10:15
  • 1
    You are trying to access same object = same memory address in the same time from each process when each process shall manage its own memory area. Try to you multiprocessing.Manager.list – naivepredictor Apr 01 '19 at 10:17
  • https://jeffknupp.com/blog/2013/06/30/pythons-hardest-problem-revisited/ – naivepredictor Apr 01 '19 at 10:20

1 Answers1

0

Try returning the list from myfunc without appening it to the master list , and then do:

master_list = pool.map(myfunc, file_list)

In short, appending to a list does not work when running multithreaded. So, you should return individual list from the function, and then pool it together using the pool object.

Dipan Ghosh
  • 176
  • 2
  • 9
  • There is still a possibility at least two processes might want to access file_list in same time. – naivepredictor Apr 01 '19 at 10:22
  • Additionaly, you mix concepts of multithreading with multiprocessing. – naivepredictor Apr 01 '19 at 10:23
  • I will try this, but will it not also be a problem that each process running myfunc will return a list with the same name (i.e. temp_list) so the last one to complete will overwrite the previous ones? – its_broke_again Apr 01 '19 at 11:07
  • Thanks and for others - I couldn't find good examples of Manager with the Pool function (most examples use Process(target...). What worked is declaring 'manager = Manager ()' then 'master_list = manager.list()' before defining myfunc – its_broke_again Apr 01 '19 at 11:59