Unordered map for unique keys and hashing

Question

I am working in a library in which some elements, that do not matter here, must be identified with a name (i.e. values are associated to names). Names are strings for the user, whatever their internal representation is, and should behave transparently.

Names are constant and initialized with strings literals. They are known at compile-time.
Two names that were initialized with different strings must compare different, whatever their internal representation is.
Names may be of arbitrary length. My library does not set any restriction.
Virtually, there must not be a limit on possible names. The implementation's constraints must not affect the interface, so, again, no restrictions from my side.

Considering that frequent lookup will happen, I have thought about using an unordered map.

Unordered associative containers store their elements, whatever their type is, by numbers (typically of type std::size_t), which are obtained with a hash function. This means that:

For types whose number of possible values is less or equal to that of the hash value, no collisions should happen.
For types whose number of possible values is greater than that of the hash value, collisions may happen, since some data is lost in the hashing process.

I have thought about two solutions.

Hashing by value

Using the data itself to compute the hash value. Considerations:

Possibly computed at compile-time. Since the names are constructed from string literals, a constexpr hashing function could be invoked by the constructor (which would be constexpr itself) and the hash value stored in the class itself, for quick retrieval later (by the hasher object).
How often would collisions happen? Which would be the best algorithm?

Hashing by order

The Boost.Log library, as explained here, maintains a global (i.e. static) table that associates names with their hash value. A possible implementation would be the following:

When a name is constructed (from a string literal), the table is looked up (performing exact comparisons).
- If it isn't found, it is registered at the end of the container.
Its entry's offset in the table becomes its hash value.

Considerations:

Very slow. For every name constructed, as much string comparisons as names registered would have to be performed. This is not much better than a traditional std::map, is it?
Thread unsafe. The table would have to be protected.
Forcibly done at runtime.

Questions

Is it right to use an unordered map under these conditions? Would it be better to use a std::map instead?
If 1 is 'yes', which approach is better, and why? The one used in Boost.Log seems really unefficient, why is it used instead of the other I explained, even if strings are not necessarily known at compile-time?

Note: I have not added the c++14 tag even though I have access to the experimental support offered by gcc and clang. Please, do not hesitate using features included in the upcoming specification.

I have just noticed that hash collision does only hurt performance, not functionality (uniqueness of the keys): more than one pair may coexist at the same bucket. Taking this into account, `std::unordered_map`, along with a `constexpr` hashing function, seems the solution. — djsp, Sep 05 '14 at 17:03

score 2 · Answer 1 · answered Sep 05 '14 at 13:11

2

Is it right to use an unordered map under these conditions? Would it be better to use a std::map instead?

If you don't need the entries to be ordered, it is typically more efficient to use an unordered_map than a map. Since both have a nearly identical interface, this is quite easy to measure, of course (which you should do).

If 1 is 'yes', which approach is better, and why? The one used in Boost.Log seems really unefficient, why is it used instead of the other I explained, even if strings are not necessarily known at compile-time?

You should read the Boost documentation a bit better. I didn't read anything about linear complexity lookups. The description of attribute_set suggests an associative container is used (I would expect std::unordered_map, but you can check the source code for yourself). The reason to use an identifier rather than a string is also clearly mentioned in the documentation:

"Working with identifiers is much more efficient than with strings. For example, copying does not involve dynamic memory allocation and comparison operators are very lightweight."

Whether this is beneficial in your case depends on the way you use these data structures. Since you indicate that string identifiers can be represented as string literals (but consider if you'll need to translate these strings), you'd only need to pass around a pointer to copy a string identifier. However, comparisons would still be slower than with boost::attribute_names.

answered Sep 05 '14 at 13:11

D Drmmr

1,223
8
15

In the section the link I provided points to, _Attribute names_, the following is stated: "The name is not stored as a string within the `attribute_name` object. Instead, a process-wide unique identifier is generated and associated with the particular name. This association is preserved until the process termination, so every time the attribute_name object is created for the same name it obtains the same identifier". From that, I deduce that there is a global table, and that, every time a name is created, it (the table) is looked up to see if the same string has already been registered. – djsp Sep 05 '14 at 16:32
@Kalrish So which part of that is "really inefficient"? – D Drmmr Sep 05 '14 at 23:21
@D_Drmmr Every time an `attribute_name` object is constructed from a string, that string is looked up in the process-wide table, i.e. compared against all the previously registered strings. Is not that inefficient? – djsp Sep 06 '14 at 12:01
@Kalrish Where did you read that a linear complexity lookup is used? Did you check the code to see for yourself how it works? – D Drmmr Sep 06 '14 at 12:59
@D Drmmr While I do not know for sure, so it seems. See http://svn.boost.org/svn/boost/trunk/libs/log/src/attribute_name.cpp, particularly the name repository singleton and the comparisons performed with `strlen`. – djsp Sep 06 '14 at 13:10
@Kalrish `//! The container that provides name-based lookup typedef intrusive::set<...` So it has logarithmic complexity lookup times. – D Drmmr Sep 06 '14 at 17:32

score 0 · Accepted Answer · answered Sep 05 '14 at 20:26

Is it right to use an unordered map under these conditions? Would it be better to use a std::map instead?

While collisions may happen for types whose number of possible values is greater than that of the hash value, when they do, the container will notice that there is already a guest in the bucket identified by the hash value, and directly compare the keys. Thus, different keys will never collide. Try with a hash function that always returns a fixed value, and see what happens when you insert keys -- it will become slow, which is why the hashing algorithm is important.

Using a std::unordered_map is therefore a good option, if, as you mention, frequent lookup will happen, and order is not required. You should still measure it against std::map to be sure, though, as D Drmmr advised.

If 1 is 'yes', which approach is better, and why? The one used in Boost.Log seems really unefficient, why is it used instead of the other I explained, even if strings are not necessarily known at compile-time?

If you were concerned about different keys colliding because their hash values were equal, then fear not; as explained above, that is not a problem. You should therefore opt for the first method, since it allows compile-time hashing, and does not suffer from all the problems of the second.

A possible implementation:

// You stated that names were constant and constructed from string literals.
// Borrowed from the example at http://en.cppreference.com/w/cpp/language/constexpr
class
    name final
{
    private:
        const char * const
            s; // string
        const std::size_t
            l; // length

    public:
        template<std::size_t N> constexpr
            name
            ( const char (& s)[N] )
            noexcept
            : s( s ) , l( N-1 )
            { }

        // Interface that enables hashing algorithms to operate on your class.
        // If hashing is to happen at compile-time, the methods must be
        // declared `constexpr`.
};

struct
    hasher final
{
    constexpr std::size_t
        operator()
        ( const name & n )
        const noexcept
        {
            return 0; // read below
        }
};

You would have to implement an interface for hashing algorithms to access the data underlying your name class. Furthermore, as stated in the example, the methods should be constexpr-declared; otherwise, they could not be called from your constexpr-enabled hashing function. As for hashing algorithms, there are many, each appropiated in some circumstances. This page elaborates on the topic and presents an implementation of X65599 that, however, does not use constexpr. You could try it first and check how it behaves in your situation.

Unordered map for unique keys and hashing

Hashing by value

Hashing by order

Questions

2 Answers2

Linked