Does C++ string hashing hash the string or the memory address?

Question

I've never studied hashing algorithms before, and I was surprised that when using std::unordered_map I found out that the hashing function (I think) actually hashes memory addresses, and not strings. Please correct me if I'm wrong, but I found this out by just changing a raw string and adding it to my unordered_map, and when the memory address (pointer) was the same it never added anything.

And in the following case whether a new key is added or not depends on whether std::string reallocates to another area of memory or not:

std::unordered_map<const char*, char*> myMap;

std::string myString = "Key1";

myMap[myString.c_str()] = "someVal";    // <--- Adds a new key, size is now 1
myString = "Key2";
myMap[myString.c_str()] = "someVal";    // <--- Doesn't add a new key "Key2" didn't need to be reallocated

However when using std::string directly in the template when I change the string it DOES add another key to my map, so that would indicate that the unordered_map template is specialised for std::string and actually hashes the string itself? Is this way slower if it has to hash the string itself?

The reason I bring this up is that the tutorials I've seen seem to convey the meaning that it's the actual string itself that gets hashed. Even here on Stack Overflow I've seen people say how (paraphrasing)"the entire string doesn't need to be checked, only as many characters as necessary" for performance reasons.

Well the impression that I got is obviously wrong for string literals and pointers to strings, but not for the std::string class?

A `char*` isn't a string. – juanchopanza Dec 17 '16 at 09:53 — juanchopanza, Dec 17 '16 at 09:53

Walter · Answer 1 · 2016-12-17T10:27:00.020

6

You're mistaken to think that const char* is a string. It actually is a pointer. Hence std::unordered_map<const char*, anything> uses pointers (of type const char*) as key, and the specialization of std::hash for pointers (which hashes the address) as hash key.

If you want to use strings as keys, you should use std::string, e.g. std::unordered_map<std::string, anything>.

edit I should also say that using pointers instead of strings is at least dangerous but often impossible. It will not do what you think. The problem is that the string (character sequence) and its address (pointer) are not necessarily paired up for the lifetime of the program (though this may be true for some const char* objects). Think of the following

std::unordered_map<const char*,int> map;
char str[11] = "bad";
map[str] = 2;           // hashes str = char*
auto x = map["bad"];    // hashes address of "bad"; x!=2

This exemplifies that using the address as key does not work as intended: you cannot obtain an element from the character sequence ("bad")

edited Dec 17 '16 at 10:27

answered Dec 17 '16 at 09:53

Walter

44,150
20
113
196

Yes, I see, it's a 'pointer' to string. I'd grown used to seeing char* as a string as you pass them wherever a string is wanted. So std::string is the only case where the actual string itself is hashed? I imagine both the hashing and the lookups are slower in this case when finding in the map? Come to think of it, if that's the case, isn't using simple pointers and string literals as the key superior? Oh, I have so many questions. – Zebrafish Dec 17 '16 at 10:06
`char*` is pointer to single character. Sometimes it happen to point to an array of characters that is *treated by the programmer* as a "string". There is no such semantic attached to `char*` - so all code had to assume "`char*` is a pointer to a single char" unless something else explicitly specified (which is not the case of maps). – Alexei Levenkov Dec 17 '16 at 22:18

WhiZTiM · Answer 2 · 2016-12-17T10:35:16.177

2

If you look at the standard basic specializations for std::hash. There is no specialization for const char * because that is simply a pointer to a character array. However, there is a specialization for any pointer type:

template< class T > struct hash<T*>;

which is what is used by std::unordered_map. It simply hashes the address.

Simply using const char* as key to your std::unordered_map with default hash and equality is messy, because the default hash function hashes the address, and the default equality function will compare the address. You should prefer std::string for your key, else you need to do something like:

std::unordered_map<const char*, char*, MyCustomHash, MyCustomEquality> myMap;

edited Dec 17 '16 at 10:35

answered Dec 17 '16 at 09:54

WhiZTiM

21,207
4
43
68

Sorry, it's my fault for not knowing more about this, but wouldn't comparing the address be a simple size_t comparison whereas the string comparison would be slower? Is it that whenever you give the map a string key it needs to compare multiple characters before it can jump to the right place? – Zebrafish Dec 17 '16 at 10:14
@TitoneMaurice Yes. But *Andy* may give me a string "Hello" which is stored somewhere, *Bob* may give me another string "Hello" which is stored elsewhere. If we are to compare addresses only, we will fail to know if they are the same. Unless we dig into the string and compare them one character after the other. Besides, using `std::string` has a performance gain because before going character by character comparison, we can conclude comparisons if the sizes are different. – WhiZTiM Dec 17 '16 at 10:19
1

Just to be pedantic, addresses are represented as [`std::uintptr_t`](http://stackoverflow.com/questions/1845482/what-is-uintptr-t-data-type) not `std::size_t` – WhiZTiM Dec 17 '16 at 10:31
@TitoneMaurice: Standard classes and functions are typically optimised in very clever ways, much more so than typical user code. – Christian Hackl Dec 17 '16 at 10:57

score 2 · Answer 3 · answered Dec 17 '16 at 10:55

Does C++ string hashing hash the string or the memory address?

This question is really about equality versus identity and depends on what you mean when you say "string".

Equality. If you mean the std::string class, then hashing has nothing to do with memory addresses. The string's actual content is hashed. Two std::string instances are equal and produce the same hash if the contents are equal to each other.
Identity. If you mean a pointer to some characters in memory, then the memory address is hashed, regardless of what data is saved there. Two "strings" are identical and produce the same hash if they point to the same memory location.

When you deal with strings, you almost always want equality comparisons and are encouraged to use std::string, because two different string instances representing the same data should be considered equal even if the data lives at different memory addresses, and std::string always provides those semantics for you, be it with hashing or with simple comparisons like myStr1 == myStr2.^[*]

Hashing char const* or char* is very dangerous, because you run into a lot of implemented-defined behaviour. String literals are the prime example for this. For example, consider the following program:

#include <iostream>

int main()
{
    char const *a = "foo";
    char const *b = "foo";

    std::cout << reinterpret_cast<void const*>(a) << "\n";
    std::cout << reinterpret_cast<void const*>(b) << "\n";
}

The C++ standard does not tell you if the addresses will be identical or not. Compilers usually allow you to control this behaviour. For instance, Visual C++ has the /GF flag for it. If you turn it on, the addresses will be identical; otherwise, they won't.

This has quite drastic consequences for hashing. In the following program, it is implementation-defined whether 1 or 2 will be printed:

#include <iostream>
#include <unordered_map>

int main()
{
    char const *a = "foo";
    char const *b = "foo";

    std::unordered_map<char const*, char*> myMap;
    myMap[a] = "1";
    myMap[b] = "2";

    std::cout << myMap.size() << "\n"; // prints 1 or 2
}

Your code also has implemented-defined behaviour; not because of literals, but in a different way:

And in the following case whether a new key is added or not depends on whether std::string reallocates to another area of memory or not:

Yes. You should never take c_str() pointers from two different std::string instances and assume that the pointers will be identical only because the std::string instances are equal.

Is this way slower if it has to hash the string itself?

No. I challenge you to come up with a realistic use case for which you can actually measure the difference. Only then is it "way" slower. Otherwise, it's plain old premature optimisation.

But there's more to it. Technically, hashing a single address should be faster than using the entire string contents (or large portions of it) to compute a hash value, because more data is involved. That's quite obvious. But I'm not sure you see the necessity of performing the "expensive" computation. There is no magic involved in this. If your program logic cares about the contents of the strings, then the individual characters must be taken into account. How, even in theory, should you be able to hash data you do not read?

Or, more generally, how to hash something you don't have?

^{[*] Coincidentally, failure to consider this distinction is the same source for a very common bug in Java, i.e. str1 == str2 having different semantics than str1.equals(str2).}

score 1 · Answer 4 · answered Dec 17 '16 at 09:56

1

The code is behaving correctly, since the key is a const char*. Try using std::string as the key to get the behavior you're looking for.

So: std::unordered_map<std::string, char*> myMap;

answered Dec 17 '16 at 09:56

geipel

395
1
3
9

Spock77 · Answer 5 · 2016-12-18T00:26:36.590

-2

Using pointers as keys could be a solution but only for CONSTANT strings - a pointer is the simplest and fastest hash. You can use different const variables to init unordered map, be sure that their lifetime is appropriate.

edited Dec 18 '16 at 00:26

answered Dec 17 '16 at 10:22

Spock77

3,256
2
30
39

Does C++ string hashing hash the string or the memory address?

5 Answers5