Reading data in parallel with multiprocess

Question

Can this be done?

What i have in mind is the following:

i ll have a dict, and each child process will add a new key:value combination to the dict.

Can this be done with multiprocessing? Are there any limitations?

Thanks!

Yes, this can be done. It's a bad idea unless you make use of a suitable mutex to make sure your dictionary is consistent across different reads, but if all you want is a bunch of processes updating something in shared memory, any mainstream language can do that for you. — Akshat Mahajan, Jul 14 '16 at 15:33
I just want to read in huge data. None of the data already in the `dict` will be edited, or deleted. — Gábor Erdős, Jul 14 '16 at 15:35

score 3 · Accepted Answer · answered Jul 14 '16 at 15:55

In case you want to just read in the data at the child processes and each child will add single key value pair you can use Pool:

import multiprocessing

def worker(x):
    return x, x ** 2

if __name__ == '__main__':
    multiprocessing.freeze_support()

    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    d = dict(pool.map(worker, xrange(10)))
    print d

Output:

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

That is the case for me. Thanks a lot! – Gábor Erdős Jul 15 '16 at 08:06 — Gábor Erdős, Jul 15 '16 at 08:06

kirkpatt · Answer 2 · 2016-07-14T15:37:57.453

2

Yes, Python supports multiprocessing.

Since you intend to work with the same dict for each "process" I would suggest multi-threading rather than multiprocessing, however. This allows each thread to use the same dict, rather than having to mess with sending the data from different processes into the parent's dict.

Obviously, you'll have issues if your method of input is user dependent or coming from stdin. But if you are getting input from a file, it should work fine.

I suggest this blog to assist you in using a thread pool. It also explains (somewhat) the use of the multiprocessing.dummy, which the docs do not.

edited Jul 14 '16 at 15:37

answered Jul 14 '16 at 15:35

kirkpatt

625
5
21

@GáborErdős Note that you may encounter recommendations not to use multi-threading because of the global interpreter lock in CPython, which makes only one thread run at a time. This should not discourage you - only CPU-bound computations are really affected by this. For I/O bound processes, like reading from something, the GIL does not come into play and `multithreading` is good to use. – Akshat Mahajan Jul 14 '16 at 15:38

score 2 · Answer 3 · edited May 23 '17 at 12:14

In the case you use multiprocessing, the entries need to be propagated to "parent processes dictionary", but there is a solution for this:

Using multiprocessing is helpful due to that guy called GIL ... so yes I did use that without thinking, as it is putting the cores to a good use. But I use a manager. like:

a_manager = multiprocessing.Manager

Then I use as shared structure:

shared_map = a_manager.dict()

and in the calls to start the process workers:

worker_seq = []
for n in range(multiprocessing.cpu_count()):
    worker_seq.append(multiprocessing.Process(target=my_work_function, args=(shared_map,))

There is quite some previous art to it like:

Reading data in parallel with multiprocess

3 Answers3

Linked