Does C++ string hashing hash the string or the memory address?
This question is really about equality versus identity and depends on what you mean when you say "string".
Equality. If you mean the std::string
class, then hashing has nothing to do with memory addresses. The string's actual content is hashed. Two std::string
instances are equal and produce the same hash if the contents are equal to each other.
Identity. If you mean a pointer to some characters in memory, then the memory address is hashed, regardless of what data is saved there. Two "strings" are identical and produce the same hash if they point to the same memory location.
When you deal with strings, you almost always want equality comparisons and are encouraged to use std::string
, because two different string instances representing the same data should be considered equal even if the data lives at different memory addresses, and std::string
always provides those semantics for you, be it with hashing or with simple comparisons like myStr1 == myStr2
.[*]
Hashing char const*
or char*
is very dangerous, because you run into a lot of implemented-defined behaviour. String literals are the prime example for this. For example, consider the following program:
#include <iostream>
int main()
{
char const *a = "foo";
char const *b = "foo";
std::cout << reinterpret_cast<void const*>(a) << "\n";
std::cout << reinterpret_cast<void const*>(b) << "\n";
}
The C++ standard does not tell you if the addresses will be identical or not. Compilers usually allow you to control this behaviour. For instance, Visual C++ has the /GF
flag for it. If you turn it on, the addresses will be identical; otherwise, they won't.
This has quite drastic consequences for hashing. In the following program, it is implementation-defined whether 1 or 2 will be printed:
#include <iostream>
#include <unordered_map>
int main()
{
char const *a = "foo";
char const *b = "foo";
std::unordered_map<char const*, char*> myMap;
myMap[a] = "1";
myMap[b] = "2";
std::cout << myMap.size() << "\n"; // prints 1 or 2
}
Your code also has implemented-defined behaviour; not because of literals, but in a different way:
And in the following case whether a new key is added or not depends on
whether std::string
reallocates to another area of memory or not:
Yes. You should never take c_str()
pointers from two different std::string
instances and assume that the pointers will be identical only because the std::string
instances are equal.
Is this way slower if it has to hash the string itself?
No. I challenge you to come up with a realistic use case for which you can actually measure the difference. Only then is it "way" slower. Otherwise, it's plain old premature optimisation.
But there's more to it. Technically, hashing a single address should be faster than using the entire string contents (or large portions of it) to compute a hash value, because more data is involved. That's quite obvious. But I'm not sure you see the necessity of performing the "expensive" computation. There is no magic involved in this. If your program logic cares about the contents of the strings, then the individual characters must be taken into account. How, even in theory, should you be able to hash data you do not read?
Or, more generally, how to hash something you don't have?
[*] Coincidentally, failure to consider this distinction is the same source for a very common bug in Java, i.e. str1 == str2
having different semantics than str1.equals(str2)
.