Python Disk-Based Dictionary

Question

I was running some dynamic programming code (trying to brute-force disprove the Collatz conjecture =P) and I was using a dict to store the lengths of the chains I had already computed. Obviously, it ran out of memory at some point. Is there any easy way to use some variant of a dict which will page parts of itself out to disk when it runs out of room? Obviously it will be slower than an in-memory dict, and it will probably end up eating my hard drive space, but this could apply to other problems that are not so futile.

I realized that a disk-based dictionary is pretty much a database, so I manually implemented one using sqlite3, but I didn't do it in any smart way and had it look up every element in the DB one at a time... it was about 300x slower.

Is the smartest way to just create my own set of dicts, keeping only one in memory at a time, and paging them out in some efficient manner?

Related question: [memoize to disk - python - persistent memoization - Stack Overflow](https://stackoverflow.com/questions/16463582/memoize-to-disk-python-persistent-memoization) -- cover `diskcache` and `joblib.Memory` etc.. — user202729, Apr 18 '23 at 03:14

score 59 · Answer 1 · answered Oct 23 '08 at 07:22

59

The 3rd party shove module is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module

from shove import Shove

mem_store = Shove()
file_store = Shove('file://mystore')

file_store['key'] = value

answered Oct 23 '08 at 07:22

Matthew Trevor

14,354
6
37
50

2

This deserves more attention than it gets. It can also be used with SQLite if you don't want to use a separate server or Berkeley DB, if you don't want to use SQLite. – Alan Plum May 24 '11 at 10:06
Yes, I've long looking for this kind of module. Plan to add redis support since that what we're using for kv store. – k4ml Jul 14 '11 at 09:26
2

Missing from the snippet is `file_store.sync()` to persist the data to the file, had a hard time with the docs, and `file_store.close()` and end of program. – OrangePot Nov 26 '20 at 22:17

score 22 · Accepted Answer · edited Dec 12 '11 at 09:21

22

Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.

edited Dec 12 '11 at 09:21

Gary van der Merwe

9,134
3
49
80

answered Oct 22 '08 at 17:34

Parand

102,950
48
151
186

score 8 · Answer 3 · answered Oct 23 '08 at 02:21

The shelve module may do it; at any rate, it should be simple to test. Instead of:

self.lengths = {}

do:

import shelve
self.lengths = shelve.open('lengths.shelf')

The only catch is that keys to shelves must be strings, so you'll have to replace

self.lengths[indx]

with

self.lengths[str(indx)]

(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)

There's no built-in caching in memory, but your operating system may do that for you anyway.

[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]

i have actually tried this, but it was way too slow.. i think i need some kind of manual paging solution to get any reasonable speed. — Claudiu, Oct 23 '08 at 11:14

score 6 · Answer 4 · answered Oct 22 '08 at 17:37

6

Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.

A thin wrapper that does SQLite queries in __getitem__ and __setitem__ isn't much code to write.

answered Oct 22 '08 at 17:37

Charles Duffy

280,126
43
390
441

How exactly would you use sqlite's indexing? the way I did it here was to create a table like this: "cur.execute('create table vals (indx INTEGER, chainlen INTEGER)')", then I "cur.execute('SELECT * from vals where indx=%d' % i)" for a lookup. – Claudiu Oct 22 '08 at 17:50
2

create table vals (indx INTEGER PRIMARY KEY, chainlen INTEGER) – Vinko Vrsalovic Oct 22 '08 at 17:52
@Claudiu - my program was such that I could implement some logic in the database layer, so I could let the DB do filtering and such; it was more than just a dumb store. – Charles Duffy Oct 23 '08 at 01:58
You could also use a cache-size pragma to tell sqlite to use more memory for its cache: http://sqlite.org/pragma.html – John Fouhy Oct 24 '08 at 05:47
4

@Claudiu ...by the way, the practice you're using there (using `%` to substitute values into your SQL statements via string formatting) is basically insecure; it's not a big deal for decimals, but if you do it for strings it leaves you open to SQL injection attacks. `cur.execute('SELECT * from vals where indx=?', (i,))` is safer, and can also run faster when invoked multiple times due to the sqlite module caching prepared statements. – Charles Duffy Mar 21 '12 at 11:32
@CharlesDuffy: aye, you're right. i generally do what you suggest, but thanks for bringing it up. never hurts to have that point driven home more to people. – Claudiu Mar 21 '12 at 13:06

score 3 · Answer 5 · answered Oct 22 '08 at 18:01

3

With a little bit of thought it seems like you could get the shelve module to do what you want.

answered Oct 22 '08 at 18:01

Dustin Wyatt

4,046
5
31
60

score 1 · Answer 6 · answered Nov 18 '08 at 10:51

I've read you think shelve is too slow and you tried to hack your own dict using sqlite.

Another did this too :

http://sebsauvage.net/python/snyppets/index.html#dbdict

It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?

score 0 · Answer 7 · answered Oct 22 '08 at 17:50

0

You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.

answered Oct 22 '08 at 17:50

Vinko Vrsalovic

330,807
53
334
373

score 0 · Answer 8 · answered Nov 14 '19 at 06:17

0

For simple use cases sqlitedict can help. However when you have much more complex databases you might one to try one of the more upvoted answers.

answered Nov 14 '19 at 06:17

Raymond

396
4
21

score 0 · Answer 9 · answered Jan 13 '23 at 20:11

It isn't exactly a dictionary, but the vaex module provides incredibly fast dataframe loading and lookup that is lazy-loading so it keeps everything on disk until it is needed and only loads the required slices into memory.

https://vaex.io/docs/tutorial.html#Getting-your-data-in

score 0 · Answer 10 · answered Mar 11 '23 at 00:27

0

I recently wrote a very simple library for this,

freeze_dried_data

It's a simple 100 line python file that keeps the keys and the index in memory, until the file is closed. Values are kept on disk and unpickled when read. It's installable with pip.

answered Mar 11 '23 at 00:27

Trevor

594
5
6

Python Disk-Based Dictionary

10 Answers10

Linked