A Set
is an example of a more general kind of objects collectively known as HashedCollections
. These use some sort of HashTable
to actually store and retrieve their elements.
Given any element
, these tables compute an integer value for it named its hash
. There are several well-known techniques to define the mapping between elements and their hash
values. Some are intrinsic, in the sense that the hash
does not depend on the attributes of the element
, which might change, and hence the hash
remains the same along the life of the element
. Others are extrinsic, in the sense that they may depend on the attributes. However, in the latter case, it is supposed that the particular elements will not be modified while they are referenced from a HashedCollection
(otherwise the HashedCollection
must be rehashed
).
The procedure for storing an element
works as follows:
- The
hash
is computed for the element
.
- An
index
to the table is computed as the remainder of the hash
modulo the length
of the table.
- If the slot at the
index
so calculated is already taken, some policy is applied to resolve the collision.
Step 1 is supposed to be really fast (e.g., the hash
has not cryptographic
strength).
Step 2 assumes (in most of the cases) that the length of the table is a prime number (powers of 2
are also used)
Step 3 can be resolved in basically two different ways:
- The table is sequentially scanned
j
times until the slot at index + j
is free, or
- The element is added to the collection of elements colliding at the given
index
(bucket)
In addition, if there aren't enough empty slots (which increases the probability of collision), the table is enlarged and rehashed
(because the modulo
changed).
With sufficient free slots and a fairly random distribution of the indexing mechanism, the probability of finding the desired slot in O(1)
is very high. Of course, if too many elements collide, the average complexity is no longer O(1)
, but this is supposedly mitigated by the growing policy (+ rehash
).
The retrieval is similar. To check whether an element
belongs in the collection, its hash
and modulo
are computed and the element
is compared with the contents of the target slot. If the comparison fails, the search proceeds linearly in the bucket.
The removal of elements is somewhat more difficult when there is no bucket
and instead the indexes
are increased, but you get the idea.
If you really want to see all of this at work, go ahead and debug the basic operations of HashedCollections
in any Smalltalk dialect. Lots of fun guaranteed.