Java: fast disk-based hash set

Question

I need to store a big hash set, able to contain up to approx 200 millions 40 bit values. Storing it as 200 millions 64 bit value would be acceptable (despite the 200 millions * 16 bits loss).

The requirements are:

tiny memory footprint (disk space ain't an issue, memory is)
fast contains(long l) and add(long l) methods (much faster than SQL)
embedded
free and without nasty licensing (no Berkeley DB). LGPL fine.
no false positive and no false negative, so things like disk-based Bloom Filters are not what I'm after

SQL is not what I'm after here .

Because I really think I'm more after something fast like this (notice how the solution is much faster than a SQL solution):

Fast disk-based hashtables?

Does Google have such a Java API?

Would a fast disk-based key/value pair implementation where I'd only use the 'key' work?

Or something else?

I'd rather not reinvent the weel.

@Thorbjorn: I'm talking about storing 200 millions values, each of which is 40 bits. I never talked about storing values that could contain value going up to 200 millions :) — cocotwo, Feb 27 '10 at 09:09
Out of curiosity, have you tried it with SQLite, etc. ? I didn't see any evidence in the question that you linked that the OP turned off journaling, or even created indexes... — Steven Huwig, Feb 27 '10 at 09:10
@Steven Huwig: no and I specified SQL was not what I what after... When I was talking about something *fast* I had more in mind answers like the one Peter Lawrey made :) Do you honestly think that in the answer linked to spending time tuning SQLite would come anywhere close near the accepted answer, which specifically states that it's running around circles SQL? We don't care about relational properties, nor ACID guarantees, etc. SQL for such simple problem is a *massive* overbloat. A golden hammer in other words. — cocotwo, Feb 27 '10 at 11:38
"Running circles" (the actual phrase in the post is "unbelievably faster") is hardly quantitative, nor is the methodology of "tuning" attempted with SQLite adequately described in the post. My question remains -- have you already implemented a solution that is too slow, if so what is the solution and how much faster does it need to be? Premature optimization is the root of all evil. — Steven Huwig, Feb 27 '10 at 13:20
@Steven Huwig: We're doing high-performance computing using Java (and Java is good at that actually). The name of the game is optimization. I asked a very specific, not too noobish, question and specifically said that SQL wasn't what I was after. I'm not after "SQL" and "XML" type of answer. I'm after answers like the one Peter Lawrey made. — cocotwo, Mar 02 '10 at 13:01
You might rule out SQL for the wrong reasons, using any of the embdedded databases (which happen to have a sql interface) is not the worst idea. Especially if you need some consitency protection. — eckes, May 27 '14 at 11:06

score 2 · Answer 1 · answered Feb 27 '10 at 09:21

2

If you can afford 128 GB of disk, you could store one bit per 40 bit value. You could then use a random access file to check a bit is set or to change it. You wouldn't have to insert any values or maintain an index.

answered Feb 27 '10 at 09:21

Peter Lawrey

525,659
79
751
1,130

@Peter Lawrey: interesting... Interestingly I had another question, before that, for something else, where I asked what was the fastest way in Java to achieve random lookup in big files :) Sadly it is not feasible in my case: this has to be installed on several computers. A few gigabytes are ok, but 128 is too much :( I like the approach that said... – cocotwo Feb 27 '10 at 11:42
You could also use a Bloom Filter it roughly takes 6bit for 1% false positive rate do detect if the key is present. IF this is the operation you are looking for. If the data is static a tree with high fan out is also possible. – eckes May 27 '14 at 11:08
Peter Lawrey: can you explain in a more detail please..(i know this is old but DS are never old) – Laukik Nov 22 '22 at 06:38
@Laukik create an array of (1 << 40) bits or 128 << 30 GB as a fixed size bit set. – Peter Lawrey Nov 22 '22 at 14:01

score 2 · Answer 2 · answered May 04 '11 at 23:07

Try sg-cdb (strange gizmo port of djb's cdb), and then swap out the RandomAccessFile with a BufferedRandomAccessFile (there's a good one in the jai-imageio code).

I'm getting mind blowing write performance. Through the roof (cuz it's all buffered then committed in one fell swoop). Read performance for the buffered RAF is not changed, though.

I could take the time to compare with H2 bulk import, might be comparable though not sure.

The database is simple: key => value. Lookup only supported on key. The keys are in my case (base32 encoded random ints + base32 encoded unique int) so the locality shouldn't really be helping much. The values are arrays of 120 random bytes.

loads (sql insert)

h2, with 131MB cache (including flush, not startup):

May 4, 2011 11:45:10 PM test.TestH2Simple main : inserts performed added in:101625 ms

database size: About 140 MB

batch size: 2000 : inserts performed added in:116875 ms

batch size: 10000 : insertsperformed added in:70234 ms

compare with sg-cdb (strange gizmo) port of cdb:

with RandomAccessFile: insert without flush:21688 ms, flush:30359 ms, total:52047 ms size of file on disk: 66.1 MB (69,315,632 bytes)

with BufferedRandomAccessFile: about 6.5 seconds

of course, it's not really a fair comparison, since H2 is flushing data continuously during the insert, whereas the sg-cdb is not. This has to be born in mind when performing the comparison. Probably fair would be to compare sg-cdb insert with H2 bulk insert

reads (sql select)

sg-cdb

: searched:490000 finished on:1304544550439 took:17547 ms, counter:0

H2

: selects performed in:19579 ms

Concerning Memory Mapped Files: they don't seem to be what you're looking for. The great performance with MMap files is when you map about 100MB or more into memory (my experience).

score 0 · Answer 3 · answered Mar 02 '10 at 13:50

I believe you will need to use a B-tree rather than a hash table. Hash tables do not have good locality for secondary storage, so you will lose too much time to disk I/O.

Most databases -- relational or not -- implement their indexes as a B-tree, so you are talking about the equivalent of storing an index with no other data attached to it.

Will you have multiple processes concurrently updating this data store?