4

Why DJB CDB (constant database) was designed to use 256 hashtables?

Why not single bigger 252 * 256 hashtable?

Is it only to save space or there are some other reason?

Nick
  • 9,962
  • 4
  • 42
  • 80

1 Answers1

5

DJB CDB uses two levels of hash tables. The first table is fixed size 2K at the beginning of the file. The second set of tables are at the end of the file, and is built in memory as data is being streamed into the cdb. Once all data has been streamed into the cdb, the second set of hash tables are streamed out to disk, and then the first table (at the beginning of the file) is populated with the offsets to each of the tables in the second set.

In other words, the multi-level hash tables allow streaming creation of the cdb with the simple exception of writing the beginning 2K of the file at the end of the cdb creation.

Access to the cdb is fast, hitting the first table (2K at beginning of the file) to find the offset of the second table (among the second set of tables) at the end of the cdb file, which provides the location of the data in the cdb.

Further information can be found in the NOTES at https://github.com/gstrauss/mcdb/ which is a rewrite of DJB's venerable cdb. mcdb is faster than cdb and removes the 4GB cdb limitation, among other benefits.

gstrauss
  • 2,091
  • 1
  • 12
  • 16
  • 2
    This doesn't address why there are 256 - The same streaming ability could be accomplished by putting a single pointer at the beginning of the file. – Gavin Wahl Aug 24 '17 at 17:09
  • 2
    The fixed size of the first-level hash table means that the first-level hash table, with fixed offset at beginning of file, will always be page-aligned in memory, and is small enough to fit into a single page of memory (typically 4k is smallest page size on x86 systems). Having two levels of hash tables can be more memory efficient in terms of reducing memory page evictions when search smaller tables. The 256 is a power of 2 and, given a good hash algorithm, uses a faster bit shift to find the hash bucket, instead of a more expensive modulus instruction. – gstrauss Sep 09 '17 at 23:46
  • 1
    Given the desire to fit the first level hash table into a single memory page, 256 is the largest power of 2 for number of cdb first-level hash table entries that will fit. – gstrauss Sep 09 '17 at 23:46