What is a good on-disk "set" implementation for Python?

Question

I'm working on a program in Python that needs to store a persistent "set" data structure containing many fixed-size hash values (SHA256, but that's not important). The critical operations are insert and lookup. Delete is not needed for regular operation. The set will grow over time and eventually may not all fit in memory.

I have considered:

a set stored on disk using pickle (slow [several seconds] to write new file to disk, eventually won't fit in memory)
a SQLite database (additional dependency not available by default)
custom disk-based balanced tree structure, such as B-tree or similar

Ideally, there would be a built-in Python module that provides something that can support these operations. What's a good option here?

After I composed this I found Fast disk-based hashtables? which has some good ideas. I like the mmap/bucket accepted answer there.

(This is for a rewrite of shaback if you're curious.)

In fact, the mmap/bucket implementation is exactly what I want. In the absence of any better ideas, I'll code it up and post it here. — Greg Hewgill, May 19 '11 at 10:35
Out of interest, what's the order of magnitude of how many values you may eventually need to store? — NPE, May 19 '11 at 10:48
@aix: It's for a backup program, one hash per file. How many files on a filesystem? :) — Greg Hewgill, May 19 '11 at 10:53

mouad · Accepted Answer · 2011-05-19T11:20:53.020

4

Another option is to use shelve, i know it's the same as pickle (under the hood) but i think it's a good option (that i didn't see in your list of options :-)) or maybe if you don't mind using a third party lib you can take a look at shove (it's like a shelve++).

edited May 19 '11 at 11:20

answered May 19 '11 at 10:44

mouad

67,571
18
114
106

That looks like a good option, it's nice how it picks a suitable backend based on what's available. – Greg Hewgill May 19 '11 at 11:13

Brian Minton · Answer 2 · 2012-08-24T20:11:18.593

You could use a DBM style database. I'm doing a similar thing with dbm, just storing all the keys with a value of '1'. Since it's BSD, the dbhash module should work. (it's deprecated, so no Python 3; and not a great idea for long-term use because of that). Otherwise, use the modules gdbm (dbm.gdbm in Python 3) and ndbm(dbm.dbm in Python 3). There's also the module dumbdbm(dbm.dumbdbm in Python 3) which is pure python and always works, but a bit slower. Also, if you are going to have multiple simultaneous reads and writes, definitely do not use the dumbdbm module.

The various dbm modules all work just like a python dictionary, except the keys and the values need to be strings. You can use the "in" keyword just like you would for a set, or a dict.

score 0 · Answer 3 · answered Jan 22 '14 at 14:26

Dbm and setting the second value as an arbitrary value of 1 as Brian Minton suggested is a convenient solution. cPickle is good too

However, You should also consider using json. Check google but AFAIK, it seems that the json parser is faster than Pickle/cPickle. (e.g., http://kovshenin.com/2010/pickle-vs-json-which-is-faster/)

score -1 · Answer 4 · answered May 19 '11 at 11:21

-1

I think this is what databases like sqlite are made for. Is there a reason you can't use it?

answered May 19 '11 at 11:21

lassej

6,256
6
26
34

SQLite is not always available by default in Python installations. – Greg Hewgill May 19 '11 at 11:24
@Greg: it's in Python 2.5 and greater; is that insufficient? – Katriel May 19 '11 at 13:58
@katriealex: It's sort-of there, in that the Python parts are installed by default. But at least on FreeBSD, installing the `.so` requires an additional step. – Greg Hewgill May 19 '11 at 19:43

What is a good on-disk "set" implementation for Python?

4 Answers4