14

My code (part of a genetic optimization algorithm) runs a few processes in parallel, waits for all of them to finish, reads the output, and then repeats with a different input. Everything was working fine when I tested with 60 repetitions. Since it worked, I decided to use a more realistic number of repetitions, 200. I received this error:

File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
 self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
 self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 302, in _handle_workers
 pool._maintain_pool()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 206, in _maintain_pool
 self._repopulate_pool()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 199, in _repopulate_pool
 w.start()
File "/usr/lib/python2.7/multiprocessing/process.py", line 130, in start
 self._popen = Popen(self)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 120, in __init__
 self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Here is a snippet of my code that uses pool:

def RunMany(inputs):
from multiprocessing import cpu_count, Pool
proc=inputs[0]
pool=Pool(processes = proc) 
results=[]
for arg1 in inputs[1]:
    for arg2 in inputs[2]:
        for arg3 in inputs[3]:
            results.append(pool.apply_async(RunOne, args=(arg1, arg2, arg3)))
casenum=0
datadict=dict()
for p in results:
    #get results of simulation once it has finished
    datadict[casenum]=p.get() 
    casenum+=1
return datadict

The RunOne function creates an object in class I created, uses a computationally-heavy python package to solve a chemistry problem that takes about 30 seconds, and returns the object with the output of the chemistry solver.

So, my code calls RunMany in serial, and RunMany then calls RunOne in parallel. In my testing, I've called RunOne using 10 processors (the computer has 16) and a pool of 20 calls to RunOne. In other words, len(arg1)*len(arg2)*len(arg3)=20. Everything worked fine when my code called RunMany 60 times, but I ran out of memory when I called it 200 times.

Does this mean some process isn't correctly cleaning up after itself? Do I have a memory leak? How can I determine if I have a memory leak, and how do I find out the cause of the leak? The only item that is growing in my 200-repetition loop is a list of numbers that grows from 0 size to a length of 200. I have a dictionary of objects from a custom class I've built, but it is capped at a length of 50 entries - each time the loop executes, it deletes an item from the dictionary and replaces it with another item.

Edit: Here is a snippet of the code that calls RunMany

for run in range(nruns):
    #create inputs object for RunMany using genetic methods. 
    #Either use starting "population" or create "child" inputs from successful previous runs
    datadict = RunMany(inputs)

    sumsquare=0
    for i in range(len(datadictsenk)): #input condition
        sumsquare+=Compare(datadict[i],Target[i]) #compare result to target

    with open(os.path.join(mainpath,'Outputs','output.txt'),'a') as f:
        f.write('\t'.join([str(x) for x in [inputs.name, sumsquare]])+'\n')

    Objective.append(sumsquare) #add sum of squares to list, to be plotted outside of loop
    population[inputs]=sumsquare #add/update the model in the "population", using the inputs object as a key, and it's objective function as the value
    if len(population)>initialpopulation:
        population = PopulationReduction(population) #reduce the "population" by "killing" unfit "genes"
    avgtime=(datetime.datetime.now()-starttime2)//(run+1)
    remaining=(nruns-run-1)*avgtime
    print(' Finished '+str(run+1)+' / ' +str(nruns)+'. Elapsed: '+str(datetime.datetime.now().replace(microsecond=0)-starttime)+' Remaining: '+str(remaining)+' Finish at '+str((datetime.datetime.now()+remaining).replace(microsecond=0))+'~~~', end="\r")
Jeff
  • 2,040
  • 3
  • 18
  • 19
  • 1
    As it is now the "results" is going to grow out of proportion very quickly, and when that happens - you will run out of memory as you never close the opened pool of processes. – Tymoteusz Paul Nov 03 '14 at 15:19
  • Puciek: "results" only has at most 20 items in it. The RunMany function is called by my main function, and "results" is local to the RunMany function. As a local variable, shouldn't it be deleted when RunMany is finished? Or do pools not work that way? – Jeff Nov 03 '14 at 15:24
  • 1
    It is supposed to work that way, but sometimes python have issues cleaning up after itself. Have a look here for similar issue http://stackoverflow.com/questions/24564782/ways-to-free-memory-back-to-os-from-python/24564983#24564983 – Tymoteusz Paul Nov 03 '14 at 15:25
  • Can you include the code you're using to call `RunMany` in a loop? – dano Nov 03 '14 at 15:28
  • Also, you could try `import gc ; gc.collect()` after each call to `RunMany` to force garbage collection, in case you're just dealing with Python not cleaning up garbage quickly enough. – dano Nov 03 '14 at 15:36
  • 1
    Minor point but I think you should use `xrange` instead of `range` here because you only need the indexing and not a list. – shuttle87 Nov 03 '14 at 16:01
  • @Puciek: I added `pool.close()` to RunMany, and now I get this error `Exception RuntimeError: RuntimeError('cannot join current thread',) in ignored` – Jeff Nov 03 '14 at 17:51
  • 1
    @Jeff i am not sure where did you add it, but it seems like you are trying to join a thread from inside of it, which is not a valid operation. – Tymoteusz Paul Nov 03 '14 at 17:56
  • 1
    @Puciek: You were correct. Now I've added `pool.close()` and `pool.join()` in the correct places in RunMany (in between the two `for` loops, and everything seems to work great. Thanks! – Jeff Nov 03 '14 at 19:53
  • @Jeff you are welcome. Please gather all this troubleshooting and new code into an answer and accept it when you can, this way future readers will be able to learn from it. – Tymoteusz Paul Nov 03 '14 at 20:25

1 Answers1

17

As shown in the comments to my question, the answer came from Puciek.

The solution was to close the pool of processes after it is finished. I thought that it would be closed automatically because the results variable is local to RunMany, and would be deleted after RunMany completed. However, python doesn't always work as expected.

The fixed code is:

def RunMany(inputs):
from multiprocessing import cpu_count, Pool
proc=inputs[0]
pool=Pool(processes = proc) 
results=[]
for arg1 in inputs[1]:
    for arg2 in inputs[2]:
        for arg3 in inputs[3]:
            results.append(pool.apply_async(RunOne, args=(arg1, arg2, arg3)))
#new section
pool.close()
pool.join()    
#end new section
casenum=0
datadict=dict()
for p in results:
    #get results of simulation once it has finished
    datadict[casenum]=p.get() 
    casenum+=1
return datadict
Jeff
  • 2,040
  • 3
  • 18
  • 19