To learn C, I'm designing a simple object system, perhaps for a Forth-like lanugage. One of the datastructures I've designed is the hashtable, or hash_t
.
typedef struct {
array_t* keys; // intelligent array object
assoc_t* vals; // array<pair>
ssize_t* idxs; // raw C array
size_t idxs_len; // and its length
} hash_t;
I've implemented it under my understanding of this description of Python 3.6's dictionaries:
a hashtable consists of:
non-sparse array_t of keys
non-sparse associative array of pairs of values and their key's hashes
sparse raw array of which values map to which actual entries.
{ 1: 'a', 2: 'b', 3: 'c' }
is represented in this structure as:
hash->keys = array{ 1, 2, 3 }
hash->vals = assoc{
pair{ 'a' 5 }
pair{ 'b' 7 }
pair{ 'c' 9 }
}
hash->idxs = array{ -1 -1 -1 -1 -1 0 -1 1 -1 2 -1 }
^ ^ ^
5 7 9
where 5, 7, and 9 are the hashes of 1, 2, and 3.
-1
takes the place of None
in the Python post to indicate a nonexistent value.
The problem arises when my key, 1
, (stringified) hashes to 0x340ca71c
, or 873,244,444
. Thus, the array of key indicies (hash->idxs
) needs to be sizeof (ssize_t) * (873,244,444 + 1)
, or 8 * 873,244,444 = 6,985,955,552
bytes, or ~7 GB, which is more RAM than my laptop has, and also more RAM than I want one hashtable to take up.
Surely, each dictionary I create in Python does not take up millions or even hundreds of thousands of bytes of RAM, yet it appears to be implemented this way, in C. What have I missed?