0

To learn C, I'm designing a simple object system, perhaps for a Forth-like lanugage. One of the datastructures I've designed is the hashtable, or hash_t.

typedef struct {
  array_t* keys;     // intelligent array object
  assoc_t* vals;     // array<pair>

  ssize_t* idxs;     // raw C array
  size_t   idxs_len; // and its length

} hash_t;

I've implemented it under my understanding of this description of Python 3.6's dictionaries:

a hashtable consists of:
    non-sparse array_t of keys
    non-sparse associative array of pairs of values and their key's hashes
    sparse raw array of which values map to which actual entries.

  { 1: 'a', 2: 'b', 3: 'c' }

  is represented in this structure as:

  hash->keys = array{ 1, 2, 3 }
  hash->vals = assoc{
    pair{ 'a' 5 }
    pair{ 'b' 7 }
    pair{ 'c' 9 }
  }
  hash->idxs = array{ -1 -1 -1 -1 -1  0 -1  1 -1  2  -1 }
                                      ^     ^     ^ 
                                      5     7     9

  where 5, 7, and 9 are the hashes of 1, 2, and 3.

-1 takes the place of None in the Python post to indicate a nonexistent value.

The problem arises when my key, 1, (stringified) hashes to 0x340ca71c, or 873,244,444. Thus, the array of key indicies (hash->idxs) needs to be sizeof (ssize_t) * (873,244,444 + 1), or 8 * 873,244,444 = 6,985,955,552 bytes, or ~7 GB, which is more RAM than my laptop has, and also more RAM than I want one hashtable to take up.

Surely, each dictionary I create in Python does not take up millions or even hundreds of thousands of bytes of RAM, yet it appears to be implemented this way, in C. What have I missed?

cat
  • 3,888
  • 5
  • 32
  • 61

1 Answers1

1

Decide how many buckets you want the hash to have based on the number of items that it will have, and then reduce the hash range to that range.

So if you want a hash to hold around 100,000 items with around 10 items per bucket, you want around 10,000 buckets. So after you compute the hash, take the hash modulo 10,000 to decide which bucket to put the item in.

Generally, using prime values for the bucket counts tend to work best, so perhaps 9,973.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • 1
    10 items per bucket, load factor 10, seems extremely high... Might be justifiable for some uses, but load factor around 1 seems better for general purpose use. – hyde Oct 26 '16 at 16:58
  • I'm assuming that he doesn't want to go to the trouble of writing code to vary the number of buckets as the number of elements changes. That forces a compromise between high load factor when the table has lots of elements versus wasted memory when it has very few. – David Schwartz Oct 26 '16 at 17:10
  • Isn't it more common to decide the load factor, and then do re-hashing with larger bucket count when it is exceeded. In any case the buckets need to be dynamic, or some othet strategy is needed for worst case input (possibly intentionally crafted), which will put most (or all) items in one bucket. – hyde Oct 26 '16 at 17:20
  • I'm not really sure which approach is more common. But for someone trying to learn how to make a hash table, the logical next step is to support some fixed number of buckets with collisions. – David Schwartz Oct 26 '16 at 17:22
  • Prime values for bucket counts is better for open addressing, but not of much use with separate chaining, where I'd suggest power-of-two number of buckets. – user464502 Oct 26 '16 at 20:34