The full text of that section states, with the last paragraph being the one you asked about:
A hash table is a data structure that maps keys to values for highly efficient lookup. In a
very simple implementation of a hash table, the hash table has an underlying array and
a hash function. When you want to insert an object and its key, the hash function maps
the key to an integer, which indicates the index in the array. The object is then stored at
that index.
Typically, though, this won't quite work right. In the above implementation, the hash
value of all possible keys must be unique, or we might accidentally overwrite data. The
array would have to be extremely large—the size of all possible keys—to prevent such
"collisions."
Instead of making an extremely large array and storing objects at index hash (key), we
can make the array much smaller and store objects in a linked list at index hash (key) %
array_length.To get the object with a particular key, we must search the linked list for
this key.
Alternatively, we can implement the hash table with a binary search tree. We can then
guarantee an 0(log n) lookup time, since we can keep the tree balanced. Additionally,
we may use less space, since a large array no longer needs to be allocated in the very
beginning.
So they're talking about using a BST (binary search tree) to handle collisions. It wouldn't actually make sense to use a BST as the sole data structure since the whole point of a properly tuned hash is that look-up is on the order of O(1)
, much better than the O(log n)
from a BST. On top of that, using a BST to totally implement a hash table means it's not actually a hash table :-)
However, consider that, when you have collisions in a hash table, a frequent way to handle them is to have each bucket contain a linked list of its items. In the degenerate case (all items hashing to the same bucket), you end up with just a linked list and the O(1)
turns into O(n)
.
So, rather than a linked list at each bucket, you have a BST. Then you no longer have O(n)
search complexity in cases where a single bucket has many items (the previously mentioned collisions).
You use the hash function to find the bucket in O(1)
then search through the BST in O(log n)
if there are collisions. In the best case (one item per bucket), it's still O(1)
. The worst case then becomes O(log n)
rather than O(n)
.
The only thing that originally concerned me about that theory is that they also discuss the fact that a large allocation is no longer necessary. If it's a shared hash/BST combination, you still need to allocate the entire hash table so that seemed incongruous.
However, from the context ("... since a large array no longer needs to be allocated ..."), it appears that they mean they can make the hash table part of the dual data structure smaller as the collisions are more efficient to process. In other words, rather than a 1000-element hash table with linked lists for collisions, you can get away with a 100-element hash table because the collisions are not so damaging to the search time if you use a BST.