Python3 shelve items iterator for multiprocessing

Question

I am trying to analyse a large shelve with multiprocessing.Pool. Being with the read-only mode it should be thread safe, but it seams that a large object is being read first, then slowwwwly dispatched through the pool. Can this be done more efficiently?

Here is a minimal example of what I'm doing. Assume that test_file.shelf already exist and is large (14GB+). I can see this snipped hugging 20GB of RAM, but only a small part of the shelve can be read at the same time (many more items than processors).

from multiprocessing import Pool
import shelve

def func(key_val):
    print(key_val)

with shelve.open('test_file.shelf', flag='r') as shelf,\
     Pool(4) as pool:
    list(pool.imap_unordered(func, iter(shelf.items()))

Pretty sure `imap_unordered` materializes the iterator you pass into it. I swear I recall a question somewhere on how to truly lazily dispatch each item/chunk, but can't quite find it... — juanpa.arrivillaga, Apr 16 '19 at 15:53
See: https://stackoverflow.com/questions/5318936/python-multiprocessing-pool-lazy-iteration — juanpa.arrivillaga, Apr 16 '19 at 15:55
@juanpa.arrivillaga oh that’s really useful indeed. Thanks! — Zenon, Apr 17 '19 at 00:52

MyNameIsCaleb · Answer 1 · 2019-04-16T17:42:46.170

Shelves are inherently not fast to open because they work as a dictionary like object and when you open them it takes a bit of time to load, especially for large shelves, due to the way it works in the backend. For it to function as a dictionary like object, each time you get a new item, that item is loaded into a separate dictionary in memory. lib reference

Also from the docs:

(imap) For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

You are using the standard chunk size of 1 which is causing it take a long time to get through your very large shelve file. The docs suggest chunking instead of allowing it to send 1 at a time to speed it up.

The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing. Unix file locking can be used to solve this, but this differs across Unix versions and requires knowledge about the database implementation used.

Last, just as a note, I'm not sure that your assumption about it being safe for multiprocessing is correct out of the box, depending on the implementation.

Edited: As pointed out by juanpa.arrivillaga, in this answer it describes what is happening in the backend -- your entire iterable may be being consumed up front which causes a large memory usage.

The whole idea of using an iterable is that the dictionary objects are loaded in memory only when called by the pool for the next one. Those are big, so I want chunksize to be 1. I am fine with waiting a bit more, and I see that behaviour, but after 20GB of RAM have been used before anything happens. As I only use the shelve for reading (with the flag 'r') it is thread safe, it would indeed be a terrible idea to write to it concurrently :) — Zenon, Apr 16 '19 at 15:00
When you open a shelf, depending on the implementation, it will take a long time and load a lot of things into memory. Your items are loaded by the iterative but will also grow in memory size from the shelf cache being added to as each item is retrieved. Your question was how to make it faster, according to the docs you should use chunk sizing to do this. Overall, shelves are very inefficient for large sizing. — MyNameIsCaleb, Apr 16 '19 at 15:38

Python3 shelve items iterator for multiprocessing

1 Answers1