I have an embarrassingly parallel problem, the function to parallelize share no memory state, but need to add a lines to a csv file. Lines can be added in the file in any order, the full stuff can take long, so we need to be able to read progression for the csv file.
Is that safe/better to use a Pool with a global Lock as initializer than using (like described in [1]) a Queue as input fed by the other worker processes and a single process writing in the csv file ?
[1] Python multiprocessing safely writing to a file
::
from random import random
from time import sleep, time
from multiprocessing import Pool, Lock
import os
def add_to_csv(line, fd='/tmp/a.csv'):
pid = os.getpid()
with lock:
with open(fd, 'a') as csvfile:
sleep(1)
csvfile.write(line)
print ' line added by {}'.format(pid)
def f(x):
start = time()
pid = os.getpid()
print '=> pi: {} started'.format(pid)
sleep(6*random())
res = 2*x
print 'pi: {} res {} in {:2.2}s'.format(pid, res, time() - start)
add_to_csv(str(res) + '\n')
return res
def init(l):
global lock
lock = l
if __name__ == '__main__':
sleep(2)
lock = Lock()
pool = Pool(initializer=init, initargs=(lock,))
out = pool.map(f, [1, 2, 3, 4])
print out
Execution get this::
=> pi: 521 started
=> pi: 522 started
=> pi: 523 started
=> pi: 524 started
pi: 521 res 2 in 1.3s
line added by 521
pi: 523 res 6 in 3.4s
line added by 523
pi: 524 res 8 in 5.2s
pi: 522 res 4 in 5.4s
line added by 524
line added by 522
[2, 4, 6, 8]