I am processing lot of data and writing them to file (large JSON).
I've implemented multiprocessing
in order to use all of my cores. I would like to dump data to file while processing it, but it seems to be tricky in parallel processing. When I run my script without any storing (everything stays in memory) I can observe on resource indicator that whole CPU is used. That's fine.
When I use another process which is reading from Manager().Queue()
(taken from here) or if I append result to file in each process (protected with Manager().Lock()
in blocking mode) CPU usage is equal to usage of one process. Changing number of workers doesn't make any difference.
pool = mp.Pool(processes=8)
manager = mp.Manager()
queue = manager.Queue()
lock = manager.Lock()
# writer_job = pool.apply_async(writer, (queue, ))
# alternative way of creating processes
# jobs = []
# for s in sensors:
# job = pool.apply_async(parseSensor, ([s, queue], ))
# jobs.append(job)
#
# for job in jobs:
# job.get()
# in the same way I am passing lock to processes
pool.map(parseSensor, zip(sensors, [queue] * len(sensors)))
# queue.put('kill')
pool.close()
pool.join()
Is that expected behaviour?