0

I want to store a large dictionary using shelve. More specifically, what I have is rather a dictionary of dictionaries, like a matrix:

dict[key1][key2] = value

The number of keys is around 5.000, which makes a matrix of 5.000x5.000 elements, impossible to store in my 4Gb of memory. For that reason I thought shelve may be a good solution.

Nevertheless, I haven't found any documentation about how to build a dictionary of dictionaries efficiently.

So far, what I have done is to use shelve to create a dictionary that contains regular dictionaries:

def create_entry(key1,key2,value,shelf): # shelf is the opened shelve file
   if key1 not in shelf: # first time we see this key, initialise it
    shelf[key1]={}
   # add the contents
   shelf[key1][key2]=value

That works, but it seems to be that there should be a better way to do that taking full advantage of shelve. Any ideas?

Muntsa
  • 33
  • 7

1 Answers1

3

The only real problem with your solution is that a shelf stores values by pickling them, so each of your second-level dicts has to be pickled and unpickled every time it's paged off to disk, which could have a significant performance cost. (Also, of course, each second-level dict has to fit in memory, but that's probably not an issue.)

If that doesn't matter, what you're doing is fine. But if it does, there are two other options.

You could use some more complex database than shelve and build your own wrapper around it.

Or, far more simple: Just use a single-level dict with pairs as keys:

def create_entry(key1, key2, value, shelf):
    shelf[(key1, key2)] = value

The downside with this is that if you need to, e.g., iterate all of the keys or values in the key1 subdictionary, you'll have to iterate the entire shelf. It's not exactly complicated ((key2 for key1, key2 in shelf if key1 == key)), but it will be a lot slower.


The code above doesn't actually work directly for shelve, because the keys have to be strings. But there are pretty easy ways to wrap that up; just as shelve pickles and unpickles values, you can write a wrapper than stringifies and parses keys—e.g.:

def _mapkey(self, key):
    return ','.join(map(str, key))
def _unmapkey(self, key):
    return tuple(map(int, key.split(',')))
def __getitem__(self, key):
    return self.shelf[self._mapkey(key)]
def __iter__(self):
    return map(self._unmapkey, self.shelf)
# etc.

For efficiency, it might be better to fork or subclass shelve so you can just generate a bytes for the underlying dbm key, instead of generating a str so that shelve can encode it. Then you could do this:

def _mapkey(self, key):
    return struct.pack('>II', key)
def _unmapkey(self, key):
    return struct.unpack('<II', key)
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thank you very much. So far I am not having memory problems with my implementation, and I do need to iterate over both keys, so I will keep it like I had it. In the future I will take into account the option of creating a data base... – Muntsa May 29 '13 at 22:03
  • @Muntsa: The pair-of-keys-as-key implementation has no problem iterating over _both_ keys. In fact, it gets _simpler_ (and probably more efficient)—instead of `for key1 in shelf: for key2 in shelf[key1]:`, you can just do `for key1, key2 in shelf:`. The issue is when you need to iterate over just `key1`, or iterate over `key2` for a specific `key1`. – abarnert May 29 '13 at 22:27
  • @abarnert Looks like you are everywhere on "shelve".Following up from my last question & your comments on my use-case.This is similar to my use case, dict[str"(100.0,200.0)"][str(an_integer)]=some_floating_value. After storing it to shelve, I need to access the keys back, and sort them by some_floating_value and str(an_integer) [see above], which is why I REALLY need my "secondary key" as int. Would shelve be a good use-case for this ? also bumped into this http://stackoverflow.com/questions/17972098/storing-a-7millions-keys-python-dictionary-in-a-database?rq=1 - would this fit my use-case ? – ekta Aug 10 '14 at 12:41
  • @abarnert My use-case has about 10^4 outer keys(primary, like key1 above) , and 10^4 inner keys(secondary,like key2 above). On a close review I noticed that I need not store all keys, but that means a lot more heavy-lifting and pop'ing keys after the whole of shelf(key1,key2) is written for one key1(I need to store only top 100 keys of key2, using some distance metric), so I can really delete them after having computed over ALL key2's, for a particular key1. Besides, I don't know how to Pop keys in shelve. – ekta Aug 10 '14 at 12:46
  • 1
    @ekta: If you have 10^8 keys, and need to do things like sorting them or treating them as two values and selecting everything based on just one of the values, then you definitely want a more powerful database than `shelve`. And since `sqlite3` comes in the stdlib and requires no setup, yes, it's a good choice. But it does mean you have to write all your queries in SQL rather than Python; if that's a stumbling block, you may want to consider an ORM or mini-ORM like `sqlalchemy`. – abarnert Aug 10 '14 at 13:50
  • @ekta: Also, thanks for reminding me of this answer, because there's a pretty obvious misleading bit (which I just fixed). – abarnert Aug 10 '14 at 13:56
  • @ekta: Finally, removing keys from a shelf is exactly the same as from a dict: `shelf.pop('a')` if you need the value, `del shelf['a']` if you don't. – abarnert Aug 10 '14 at 13:57
  • @abarnert I will try the sqllite, and let you know. Prelim note -: I tried the OP's code, and now did the insert happen in 70 secs, even after creating the table with integers & PrimaryKey. Probably I need to read and Understand more & do less. Thank you again for your inputs. – ekta Aug 10 '14 at 18:01
  • @ekta: Good luck! Learning about relational databases (to the point where you understand what the poster in that other thread is going on about with indexes and why) is a pretty big topic. Hopefully there are good free tutorials nowadays; I learned by reading a textbook on relational theory, another textbook on how that was mapped to fancy new RDBMSs like DB2 (which came out in 1983). :) – abarnert Aug 10 '14 at 19:20
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/59066/discussion-between-ekta-and-abarnert). – ekta Aug 10 '14 at 19:57
  • @abarnert I still keep finding better ways to do this, and more. Landed here, http://stackoverflow.com/questions/20786895/python-populate-a-shelve-object-dictionary-with-multiple-keys - if possible, would like your take on this. Would this approach avoid the "tuple" object for multiple keys ? Shelf[primary-Key][Secondary-key] // Not possible - so we used Shelf[(primary-Key, Secondary-key)] as tuple of keys. Works slower for iteration, while reading back from db. – ekta Aug 30 '14 at 18:54
  • @ekta: I don't understand your question. Both my answer and the answer you linked to are using some kind of aggregate key (whether an encoded tuple or otherwise), so I don't see how either one avoids aggregate keys, or even attempts to. As I said in one of the other questions, you really should create a question and lay out exactly what you want to know (with all the links and background that's relevant) instead of posting followup comments. – abarnert Sep 02 '14 at 17:57
  • @abarnert Posted this here, http://stackoverflow.com/questions/25660288/optimizing-reads-writes-in-a-shelve-object-to-scale-up-for-larger-dataset, would aprreciate your take on this. I still haven't posted my original question on the tuple with floating point as keys, but that's for another time. – ekta Sep 04 '14 at 08:07