2

As discussed here: Python: Multiprocessing on Windows -> Shared Readonly Memory I have a heavily parallelized task.

Multiple workers do some stuff and in the end need to access some keys of a dictionary which contains several millions of key:value combinations. The keys which will be accessed, are only known within the worker after some further action also involving some file-processing (the example below is just for demonstration purposes, hence simplified in that way).

Before, my solution was to keep this big dictionary in memory, pass it once into shared memory and access it by the single workers. But it consumes a lot of RAM... So I wanted to use shelve (because the values of that dictionary are again dicts or lists).

So a simplified example of what I tried was:

def shelveWorker(tupArgs):
    id, DB = tupArgs
    return DB[id]

if __name__ == '__main__':
    DB = shelve.open('file.db', flag='r', protocol=2)
    joblist = []
    for id in range(10000):
        joblist.append((str(id), DB))

    p = multiprocessing.Pool()
    for returnValue in p.imap_unordered(shelveWorker, joblist):
        # do something with returnValue
        pass
    p.close()
    p.join()

Unfortunately I get:

"TypeError: can't pickle DB objects"

But IMHO it does not make any sense to open the shelve itself (DB = shelve.open('file.db', flag='r', protocol=2)) within each worker on its own because of slower runtime (I have several thousand workers).

How to go about it?

Thanks!

tim
  • 9,896
  • 20
  • 81
  • 137
  • Hey. "I have several thousand workers" - are you sure? According to the defaults of `multiprocessing.Pool()` you'll have as many workers as there are CPUs. Workers are then reused to process multiple of the `joblist` items each. At least that's how I get it (my experience with Python is quite limited though) – Ilya Denisov Mar 15 '22 at 11:05

0 Answers0