I'm working on a program in Python that needs to store a persistent "set" data structure containing many fixed-size hash values (SHA256, but that's not important). The critical operations are insert and lookup. Delete is not needed for regular operation. The set will grow over time and eventually may not all fit in memory.
I have considered:
- a
set
stored on disk usingpickle
(slow [several seconds] to write new file to disk, eventually won't fit in memory) - a SQLite database (additional dependency not available by default)
- custom disk-based balanced tree structure, such as B-tree or similar
Ideally, there would be a built-in Python module that provides something that can support these operations. What's a good option here?
After I composed this I found Fast disk-based hashtables? which has some good ideas. I like the mmap/bucket accepted answer there.
(This is for a rewrite of shaback if you're curious.)