I have a iterable object in python Z
, which is to large to fit into memory. I would like to perform a parallel calculation over this object and write the results, in order that they appear in Z
, to a file. Consider this silly example:
import numpy as np
import multiprocessing as mp
import itertools as itr
FOUT = open("test",'w')
def f(x):
val = hash(np.random.random())
FOUT.write("%s\n"%val)
N = 10**9
Z = itr.repeat(0,N)
P = mp.Pool()
P.map(f,Z,chunksize=50)
P.close()
P.join()
FOUT.close()
There are two major problems with this:
- multiple results can be written to the same line
- a result is returned with
N
objects in it - this will be to big to hold in memory (and we don't need it!).
What I've tried:
- Using a global lock
mp.Lock()
to share theFOUT
resource: doesn't help, because I think each worker creates it's own namespace. - Use
apply_async
instead ofmap
: While having callback fixes 1], 2], it doesn't accept an iterable object. - Use
imap
instead ofmap
and iterating over the results:
Something like:
def f(x):
val = hash(np.random.random())
return val
P = mp.Pool()
C = P.imap(f,Z,chunksize=50)
for x in C:
FOUT.write("%s\n"%x)
This still uses inordinate amounts of memory, though I'm not sure why.