I am working in a library in which some elements, that do not matter here, must be identified with a name (i.e. values are associated to names). Names are strings for the user, whatever their internal representation is, and should behave transparently.
- Names are constant and initialized with strings literals. They are known at compile-time.
- Two names that were initialized with different strings must compare different, whatever their internal representation is.
- Names may be of arbitrary length. My library does not set any restriction.
- Virtually, there must not be a limit on possible names. The implementation's constraints must not affect the interface, so, again, no restrictions from my side.
Considering that frequent lookup will happen, I have thought about using an unordered map.
Unordered associative containers store their elements, whatever their type is, by numbers (typically of type std::size_t
), which are obtained with a hash function. This means that:
- For types whose number of possible values is less or equal to that of the hash value, no collisions should happen.
- For types whose number of possible values is greater than that of the hash value, collisions may happen, since some data is lost in the hashing process.
I have thought about two solutions.
Hashing by value
Using the data itself to compute the hash value. Considerations:
- Possibly computed at compile-time. Since the names are constructed from string literals, a
constexpr
hashing function could be invoked by the constructor (which would beconstexpr
itself) and the hash value stored in the class itself, for quick retrieval later (by the hasher object). - How often would collisions happen? Which would be the best algorithm?
Hashing by order
The Boost.Log library, as explained here, maintains a global (i.e. static) table that associates names with their hash value. A possible implementation would be the following:
- When a name is constructed (from a string literal), the table is looked up (performing exact comparisons).
- If it isn't found, it is registered at the end of the container.
- Its entry's offset in the table becomes its hash value.
Considerations:
- Very slow. For every name constructed, as much string comparisons as names registered would have to be performed. This is not much better than a traditional
std::map
, is it? - Thread unsafe. The table would have to be protected.
- Forcibly done at runtime.
Questions
- Is it right to use an unordered map under these conditions? Would it be better to use a
std::map
instead? - If 1 is 'yes', which approach is better, and why? The one used in Boost.Log seems really unefficient, why is it used instead of the other I explained, even if strings are not necessarily known at compile-time?
Note: I have not added the c++14
tag even though I have access to the experimental support offered by gcc and clang. Please, do not hesitate using features included in the upcoming specification.