Fast disk-based hashtables?

Question

I have sets of hashes (first 64 bits of MD5, so they're distributed very randomly) and I want to be able to see if a new hash is in a set, and to add it to a set.

Sets aren't too big, the largest will be millions of elements, but there are hundreds of sets, so I cannot hold them all in memory.

Some ideas I had so far:

I tried just keeping it all in sqlite table, but it becomes really really slow once it cannot fit everything in memory.
Bloom filters sound like they would have very high error rate. I don't mind tiny error rate (64 bit hash gives 1 collision on 4G element set already), but error rates like 1% are a lot too high.
Keep sorted list of hashes with gaps in a file, and resize when I don't have enough gaps. Hashes are uniformly distributed, so even very simple scheme like that should work.

Am I missing something really obvious? Any hints how to implement good disk-based hashtable?

is the data in the sets somehow related? can most of the data be found in most other sets too, or is it in exactly one? — martinus, Jan 30 '09 at 12:55
If you have a very high miss rate (i.e. most lookups are negative) the bloom filter in front of b-tree or cdb still helps you. As you only need to disk-lookup the 1% errors. — eckes, May 27 '14 at 11:12

taw · Accepted Answer · 2009-02-04T15:23:39.327

19

Here's the solution I eventually used:

One file per set
File contains 2^k buckets, each 256 bytes or 32 entries of 8 bytes
Empty entries are just zeroed out (000... is a valid hash, but I don't care about 2^-64 chance of collision, if everything can collide with everything else already, by the nature of hashing).
Every hash resides in bucket guessed via its first k bits
If any bucket overflows, double file size and split every bucket
Everything is accessed via mmap(), not read()/write()

It's just unbelievably faster than sqlite, even though it's low-level Perl code, and Perl really isn't meant for high performance databases. It will not work with anything that's less uniformly distributed than MD5, its assuming everything will be extremely uniform to keep the implementation simple.

I tried it with seek()/sysread()/syswrite() at first, and it was very slow, mmap() version is really a lot faster.

edited Feb 04 '09 at 15:23

answered Feb 03 '09 at 22:23

taw

18,110
15
57
76

>> If any bucket overflows, double file size and split every bucket I hear about this technique, but how exactly this works? – Nick Aug 08 '13 at 15:36
1

Nick: The simplest way is to just create a new empty file, iterate through every hash from old file and add them one by one. – taw May 27 '14 at 15:36
how did you fill buckets in the middle of the file ? did you recreate new file after insertion ? – user12384512 Jul 29 '17 at 17:59
user12384512: What do you mean? It fills whole file with zeroes when it initializes, so buckets in the middle are empty. – taw Aug 03 '17 at 15:26
for large file, mmap would take so much memory to load whole file in ram? – TomSawyer Apr 22 '20 at 15:30

score 12 · Answer 2 · answered Feb 03 '09 at 22:32

I had some trouble picturing your exact problem/need, but it still got me thinking about Git and how it stores SHA1-references on disk:

Take the hexadecimal string representation of a given hash, say, "abfab0da6f4ebc23cb15e04ff500ed54". Chop the two first characters in the hash ("ab", in our case) and make it into a directory. Then, use the rest ("fab0da6f4ebc23cb15e04ff500ed54"), create the file, and put stuff in it.

This way, you get pretty decent performance on-disk (depending on your FS, naturally) with an automatic indexing. Additionally, you get direct access to any known hash, just by wedging a directory delimiter after the two first chars ("./ab/fab0da[..]")

I'm sorry if I missed the ball entirely, but with any luck, this might give you an idea.

score 6 · Answer 3 · answered Jan 30 '09 at 11:07

6

Sounds like a job for Berkeley DB.

answered Jan 30 '09 at 11:07

David Schmitt

58,259
26
121
165

1

Yes, I thought about it. It has nasty licensing, but first of all I wonder if it would really be faster than sqlite once it goes over available memory. – taw Jan 30 '09 at 11:13
BDB gives you many more knobs to tune than sqlite. Since you know your data very well, this can give you the needed edge. In the end though, running out of memory *is* a hard barrier and only trying it out will answer your question. – David Schmitt Jan 30 '09 at 12:16

score 3 · Answer 4 · answered Dec 22 '11 at 03:11

3

Other disk-based hashing algos/data structures include linear hashing and extensible hashing.

answered Dec 22 '11 at 03:11

Sid Anand

31
2
4

score 1 · Answer 5 · edited Feb 04 '09 at 15:43

Two algorithms come to my mind at first:

Use a b-tree.
Separate-chain the hashes themselves by doing something like using the first 10 bits of your hash to index into one of 1024 individual files, each of which contains a sorted list of all the hashes starting with those 10 bits. That gives you a constant-time jump into a block that ought to fit into memory, and a log(n) search once you've loaded that block. (or you could use 8 bits to hash into 256 files, etc.)

score 0 · Answer 6 · answered Jan 30 '09 at 13:01

0

Since for a hash you have to use random access, I doubt any database will give you decent performance. Your best bet might be to up the disc cache (more RAM), and get harddisks with a very high random access speed (maybe solid state disks).

answered Jan 30 '09 at 13:01

martinus

17,736
15
72
92

2

I did it with a very simple hashtable stored on disk and accessed via mmap() - 32 entries per bucket, doubling file size on overflow of any bucket - it's just unbelievably faster than sqlite, even though I implemented it in Perl, which really isn't meant for stuff like that. – taw Feb 03 '09 at 22:17

Fast disk-based hashtables?

6 Answers6

Linked