3

I have a iterable object in python Z, which is to large to fit into memory. I would like to perform a parallel calculation over this object and write the results, in order that they appear in Z, to a file. Consider this silly example:

import numpy as np
import multiprocessing as mp
import itertools as itr

FOUT = open("test",'w')

def f(x):
    val = hash(np.random.random())
    FOUT.write("%s\n"%val)

N = 10**9
Z = itr.repeat(0,N)

P = mp.Pool()
P.map(f,Z,chunksize=50)
P.close()
P.join()

FOUT.close()

There are two major problems with this:

  1. multiple results can be written to the same line
  2. a result is returned with N objects in it - this will be to big to hold in memory (and we don't need it!).

What I've tried:

  • Using a global lock mp.Lock() to share the FOUT resource: doesn't help, because I think each worker creates it's own namespace.
  • Use apply_async instead of map: While having callback fixes 1], 2], it doesn't accept an iterable object.
  • Use imap instead of map and iterating over the results:

Something like:

def f(x):
    val = hash(np.random.random())
    return val

P = mp.Pool()
C = P.imap(f,Z,chunksize=50)
for x in C: 
    FOUT.write("%s\n"%x)

This still uses inordinate amounts of memory, though I'm not sure why.

Hooked
  • 84,485
  • 43
  • 192
  • 261
  • I answered this one the other day about processing a file with multiple processes: http://stackoverflow.com/a/11196615/496445 , though it doesn't fully answer the memory issue. Would this other answer + writing to separate files in each process fix your issue? – jdi Jun 26 '12 at 23:18
  • concerning that last code snippet, maybe calling `FOUT.flush()` from time to time will help reduce the memory usage? – dmytro Jun 27 '12 at 00:24
  • @dmyto I tried it and there was no change in the memory usage. The problem isn't that the file buffer is taking up space, but that the intermediate results are being stored to memory. – Hooked Jun 27 '12 at 14:03

0 Answers0