On-disk structure for storing a large set of 128-bit integers?

Question

I have about 500 million 128-bit integers, adding about 100M per year. Nothing is ever deleted. The numbers come at a uniform distribution, scale-wise and time-wise.

Basically, all I need is an add operation that also returns whether the number already exists in the DB. Also, I do not want to use too much RAM for this system, so just storing everything in memory is not what I'm looking for.

So far we've been using several MyISAM tables on MySQL, using two bigints as a primary key. This give us OK performance, but I suspect it is not the right tool for this job. We've had some performance problems before splitting tables, and we've had corruptions on power outages. Also, a DB gives us many more feature that we do not need.

I'm using Python on Linux, but I'm open to suggestions.

Similar question in C++.

UPDATE: Marcelo's comment mentioned Bloom Filter, which seems really promising to me. Since I'm working with hashes, I've already given up on complete accuracy, so this might be a great precision/performance trade off.

What can you tell us about the distribution of the numbers ? About the additions each year ? — High Performance Mark, Nov 18 '10 at 09:18
It (should be) uniform, the numbers are hashes. Steady pace, so about 3 add operations per second. — itsadok, Nov 18 '10 at 09:28

Marcelo Cantos · Accepted Answer · 2010-11-19T04:56:11.990

3

Insert each integer into one of a pool of 2ⁿ SQLite databases (2⁸ is probably a good number) chosen by computing an n-bit hash of the integer. Make the one column of the one table a primary key, so that attempting to insert an existing number fails.

Assuming the integers are already random enough, you can probably just pick, say, the least significant byte as the "hash".

EDIT: I've done some testing. I've inserted 25 million entries in about two hours, but it has gobbled up over 1 GB in the process. This is done by generating random numbers and distributing them to 32 subprocesses, each with one SQLite database under its control and commits once every 100,000 inserts. Insertions are currently chugging along at about 1350 Hz, far beyond the 3 Hz your problem demands, but the entire database still fits in cache (I have 8 GB RAM). I won't know the steady-state performance unless I get close to your current database size. At that point, every insertion will induce at least four disk-head movements (read and write the index and table, probably more to drill down into the B+ tree), and then you'll know how much pain you're really in for.

I'm starting to think that this is a genuinely interesting problem that could be solved much more efficiently with a custom-built solution. However, I suspect it will take a reasonable amount of effort to significantly outperform a database engine.

edited Nov 19 '10 at 04:56

answered Nov 18 '10 at 09:54

Marcelo Cantos

181,030
38
327
365

This sounds very similar to what I'm already doing (with n=4). Any reason to prefer SQLite over MySQL? I suspect that the numbers will be stored twice - once for the index and once for the "data". Any idea if that's true? – itsadok Nov 18 '10 at 10:30
Accepting this as a "no answer". Sometimes it feels like B+ trees are all CS has to offer... – itsadok Nov 18 '10 at 21:08
I like SQLite, vs a client-server setup, because it doesn't require a server and has zero maintenance overhead. Also, I almost always use SQLite databases in lieu of simple files. They are more reliable, often faster (depending on the problem) for the same amount of coding effort, and much more flexible. I don't know whether SQLite stores data twice for indexing; I'd guess that it does, to keep things simple when you `DROP INDEX`, but even then, you would only be adding 1 GB/year. Your hard disk should cope just fine. – Marcelo Cantos Nov 18 '10 at 22:57
RE: B+ trees: No they aren't all it has to offer. You could enlist the help of Bloom filters, front-coding, distributed hash tables, etc. But it's a question of whether it's worth the effort for your particular problem. Most of the time, "stick it in a database," is the best answer from an engineering perspective. – Marcelo Cantos Nov 18 '10 at 23:01

score 0 · Answer 2 · answered Nov 18 '10 at 13:20

0

Hash your hashes ?

answered Nov 18 '10 at 13:20

High Performance Mark

77,191
7
105
161

I'm not sure I understand your answer – itsadok Nov 18 '10 at 21:06

On-disk structure for storing a large set of 128-bit integers?

2 Answers2