7

I'm working on a program in Python that needs to store a persistent "set" data structure containing many fixed-size hash values (SHA256, but that's not important). The critical operations are insert and lookup. Delete is not needed for regular operation. The set will grow over time and eventually may not all fit in memory.

I have considered:

  • a set stored on disk using pickle (slow [several seconds] to write new file to disk, eventually won't fit in memory)
  • a SQLite database (additional dependency not available by default)
  • custom disk-based balanced tree structure, such as B-tree or similar

Ideally, there would be a built-in Python module that provides something that can support these operations. What's a good option here?

After I composed this I found Fast disk-based hashtables? which has some good ideas. I like the mmap/bucket accepted answer there.

(This is for a rewrite of shaback if you're curious.)

Community
  • 1
  • 1
Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • In fact, the mmap/bucket implementation is exactly what I want. In the absence of any better ideas, I'll code it up and post it here. – Greg Hewgill May 19 '11 at 10:35
  • Out of interest, what's the order of magnitude of how many values you may eventually need to store? – NPE May 19 '11 at 10:48
  • @aix: It's for a backup program, one hash per file. How many files on a filesystem? :) – Greg Hewgill May 19 '11 at 10:53

4 Answers4

4

Another option is to use shelve, i know it's the same as pickle (under the hood) but i think it's a good option (that i didn't see in your list of options :-)) or maybe if you don't mind using a third party lib you can take a look at shove (it's like a shelve++).

mouad
  • 67,571
  • 18
  • 114
  • 106
0

You could use a DBM style database. I'm doing a similar thing with dbm, just storing all the keys with a value of '1'. Since it's BSD, the dbhash module should work. (it's deprecated, so no Python 3; and not a great idea for long-term use because of that). Otherwise, use the modules gdbm (dbm.gdbm in Python 3) and ndbm(dbm.dbm in Python 3). There's also the module dumbdbm(dbm.dumbdbm in Python 3) which is pure python and always works, but a bit slower. Also, if you are going to have multiple simultaneous reads and writes, definitely do not use the dumbdbm module.

The various dbm modules all work just like a python dictionary, except the keys and the values need to be strings. You can use the "in" keyword just like you would for a set, or a dict.

Brian Minton
  • 3,377
  • 3
  • 35
  • 41
0

Dbm and setting the second value as an arbitrary value of 1 as Brian Minton suggested is a convenient solution. cPickle is good too

However, You should also consider using json. Check google but AFAIK, it seems that the json parser is faster than Pickle/cPickle. (e.g., http://kovshenin.com/2010/pickle-vs-json-which-is-faster/)

ThinkBonobo
  • 15,487
  • 9
  • 65
  • 80
-1

I think this is what databases like sqlite are made for. Is there a reason you can't use it?

lassej
  • 6,256
  • 6
  • 26
  • 34