9

I've doing some performance analysis on the software I develop, and I've found that lookups on a global dictionary of URL's takes about 10% of the application's "load" phase time. The dictionary is implemented as a C++ STL std::map, which has O(lg n) lookups. I'm going to move it to a hash_map, which has roughly fixed time lookups. The stl string class doesn't have a hash code property, and it certainly doesn't cache a hash code. That means that each lookup requires re-generating the hash code.

I'm skeptical that caching the hash code is worth the effort. It would mean changing many lines of code to use a new string class with a cached hash code property. Given that the current implementation does log(n) full string comparisons on every lookup, I think reducing it to basically one string traversal (by the hash function) per lookup is a big win.

Does anyone have experience with caching string hash codes? Has it ever proven worth the effort?

David Gladfelter
  • 4,175
  • 2
  • 25
  • 25
  • 2
    Hashing takes a very small amount of time. How do you intend on keeping these string hash's cached? I mean, if you're keeping a string around which already has the hash, why not just keep the object associated with that hash instead? – GManNickG Feb 03 '10 at 19:58
  • Can't you wrap your strings in a auxiliary object which keeps the hash? – Skurmedel Feb 03 '10 at 20:01
  • 8
    Also, don't use `hash_map`, that's old extension. Use `unordered_map` instead, either in TR1 or Boost. – GManNickG Feb 03 '10 at 20:02
  • 2
    Caching hashes can only be safe for immutable objects, which strings aren't. Other than that, it's a major complication, since you'd have to be storing a tuple that combines the hash with the hashed. I wouldn't recommend it unless you're hashing something really big *and* you've benchmarked the code under realistic conditions and found the difference to be important. – Steven Sudit Feb 03 '10 at 20:05
  • @Skurmedel: If I were to cache the hash code it would be by deriving from the standard string class and adding the necessary functionality, but all clients that generate the strings would then have to use the new string class. That's a lot of code to change, which is why I'm reluctant to cache. @GMan, thanks, I'll take a look at unordered_map. – David Gladfelter Feb 03 '10 at 20:06
  • 1
    @David: Deriving from STL classes is a bad idea; they aren't made for that. – GManNickG Feb 03 '10 at 20:15
  • Quick question: We're talking about O(lg n), but what's the `n` in this case? – Steven Sudit Feb 03 '10 at 20:15
  • @David Gladfelter: Without recommending caching the hashes, I'd think that the right way would be to create a tuple that holds a regular (const) string and a hash value, such that it returns that cached hash value. The caller would look it up by hash and then dereference the string. – Steven Sudit Feb 03 '10 at 20:17
  • Which was my suggestion from the start ;)... but I had a pair in mind. – Skurmedel Feb 03 '10 at 20:19
  • @Skurkmedel: Yes, and your answer has priority. Looks like that solution would involve a `std::pair >`, with a custom `Hash` evaluator that takes the inner pair and returns its key. – Steven Sudit Feb 03 '10 at 20:34
  • @Steven: For the purposes of a hash, the key had better not change even if the object is immutable -- otherwise you'd never find it again. – Joel Feb 03 '10 at 21:01
  • @Joel: To clarify, the key has to be immutable, so that a deterministic hash of it never changes. – Steven Sudit Feb 03 '10 at 21:53
  • @Steven Sudit: No problem. You seem far more versed in these matters than me anyhow :) (interesting trivia, your typo of my name turned it into another Swedish word :)) – Skurmedel Feb 03 '10 at 22:10

5 Answers5

3

One word of warning.

While a hash map can have fixed time lookups, it also can end up having O(N) lookups. While it's not a common case, it does happen.

So while you always have to pay for the O(log N) time in a map, you are also guaranteed that it will not be worse.

R Samuel Klatchko
  • 74,869
  • 16
  • 134
  • 187
  • 4
    On implementations written by monkeys smashing a keyboard. – GManNickG Feb 03 '10 at 20:01
  • It depends on the hashing function as well? A good hash function should leave few collisions. If I recall correctly the load factor could play a part as well, depending on the kind of hashmap. – Skurmedel Feb 03 '10 at 20:04
  • 1
    @GMan: I guess a lot of monkeys have found gainful employment as programmers and Shakespearan ghostwriters, then, because I've seen some pretty bad hashes. – Steven Sudit Feb 03 '10 at 20:07
  • @Skurmedel: Yes, a hash that does `return 0;` or some other crappy thing, or a hash map with a size of 1. Neither of those really exist in real code. Hash maps are almost always ~O(1). @Steven: Yuck; why are people making their own hashes? That's as foolish as thinking it's easy to make a good random umber generator. :) – GManNickG Feb 03 '10 at 20:13
  • @GMan: malicious input designed to collide for your hash? ;-) – Steve Jessop Feb 03 '10 at 20:25
  • Unless you have a perfect hash function designed for your fix set of inputs, it can happen. A got bit by this once about 10 years ago and it's made me cautious about hash tables ever since. – R Samuel Klatchko Feb 03 '10 at 20:35
  • 1
    @GMan: Not so much making up their own as randomly picking a known one that happens to be terrible. There are plenty of hashes out there in real code that have generally poor distribution, or at least huge Achiles' heels when subjected to a certain range of inputs. So, yes, I generally agree that we should trust the hash algorithms in reliable libraries, since there are also plenty of algorithms that yield good results for almost any input. – Steven Sudit Feb 03 '10 at 20:41
  • 2
    @R Samuel Klatchko:if you're really concerned with the worst case performance of a hash table, you can limit it to O(log N) instead of O(N) (use trees instead of lists for hash collisions). – Jerry Coffin Feb 03 '10 at 20:47
  • @Jerry: Yes, interesting, particularly if the tree structure helps avoid a lot of copying when it's time to expand the table. Although I'm not sure whether it can. – Steven Sudit Feb 03 '10 at 21:55
  • @Jerry - interesting idea. Do you know of any implementations of unordered_map that use that? – R Samuel Klatchko Feb 03 '10 at 22:08
  • @Steven Sudit:For the most part, when you use it you just don't expand the table -- degradation only starts to become noticeable if you overfill the table by a *huge* margin (>100000:1). @R samuel:I've done an implementation (basically just an std::vector), but I don't know of anybody else having done it. – Jerry Coffin Feb 04 '10 at 04:32
  • @R Samuel: you can't implement `unordered_map` itself that way, because the key type of an `unordered_map` isn't necessarily orderable (no `operator<` or `std::less` specialization), so you can't necessarily have a map of them. You can do what Jerry did, which is define your own associative map which requires that the keys are hashable *and* ordered (and which takes the corresponding optional extra functor parameters). – Steve Jessop Feb 04 '10 at 23:27
3

I don't have experience with caching hash codes, but I've done some work recently converting std::map to std::tr1::unordered_map. Two thoughts come to mind. First, try profiling that relatively simple change first, because it sometimes makes things worse, depending on what your code is doing. It might give you enough speedup on its own before you try optimizing further. Secondly, what does your profiler say about the other 90% of your initialization time? Even if you optimized the global dictionary stuff down to 0 time, you will at most improve performance by 10%.

Michael Kristofik
  • 34,290
  • 15
  • 75
  • 125
  • 1
    Hi Kristo, Thanks for the tips. To answer your question, I've already killed about 40% of the previous load time and I'm running out of low-hanging fruit. That 10% is relatively juicy and in reach. – David Gladfelter Feb 03 '10 at 20:14
3

When you compare the hash map to the map, also try a Trie, or related data structure (whatever you can get off the shelf):

Trie implementation

Unfortunately you may then spend a lot of time worrying about cache-friendliness. In that respect a Trie is similar to the tree you already have, and a hash map will probably be better-behaved than a naively-allocated tree.

Also, I'm a little confused by the question. If you're looking up the same string object multiple times, such that caching its hash value is worthwhile, shouldn't you just be caching the result of the lookup? The whole point of a hash table is that different objects which are value-equal hash to the same value. If you aren't computing the same hash several times from distinct strings containing the same characters, then your hash table probably isn't doing its job.

If you mean caching the values of the keys already in the hash table, that's up to the hash table.

Community
  • 1
  • 1
Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
2

You'll of course need to profile to check your results. Change to a hash map, and then see where most of your time is spent. Unless you're hashing keys left and right, I doubt most of your time will be spent there. Hashing is intended to be a fast operation, otherwise a hash map would have no advantages over an ordered container.

The compiler itself will know if a string hasn't been changed, and can probably cache the result for you (within the same scope). That said, you don't want to inherit from std::string; STL classes weren't made for that.

Rather, make a std::pair and pass that around:

std::pair<const std::string, const size_t> string_hash_pair;

You'd then need to overload the (going by Boost here, not TR1; I don't know how similar they are) hash_value function for your type, in the same namespace as the pair is defined:

size_t hash_value(const string_hash_pair& pPair)
{
    return pPair.second; // don't actually hash
}

And that's it. Note that in the pair, both string and size_t are immutable. This is because if the string changes, your hash is wrong. So we make it const, and we may as well make the hash const too.

You'll want a helper function:

string_hash_pair make_string_hash(const std::string& pStr)
{
    return std::make_pair(pStr, boost::hash_value(pStr));
}

Now if you're going to be using a string for look-ups, just make a pair out of it and you get constant-time hashing.

That said, I really doubt this much work is necessary. Hashing functions really are trivial, usually. Also, don't make your own. Use a pre-existing tried-and-tested hash; it's quite easy to make a crappy hash.

GManNickG
  • 494,350
  • 52
  • 494
  • 543
  • If I understand correctly, the hash of the key is calculated exactly once during lookup, no matter how big the table is, so caching it shouldn't help much. The hash is calculated during insertion, but each item is only inserted once. – Steven Sudit Feb 03 '10 at 20:37
  • @Steven: It's calculated once per look-up. But if you look up with the same key multiple times, you may calculate the hash multiple times. I think he wants to avoid that. – GManNickG Feb 03 '10 at 20:43
  • That's true. I can't tell from the OP whether the same key will be looked up more than once, but if so then that would justify it more. Of course, there might also be ways to reorganize the code so that values are not looked up any more than absolutely necessary. – Steven Sudit Feb 03 '10 at 21:54
1

I did some comparisons of a set and unordered set with 4k - 64k strings in my dictionary.

I found that a std::set and unordered_set had about the same runtime in my situation because hash_value calculation took about 80% of the run time for the unordered set.

It dwarfed the lookup savings (used boost::hash_value for std::string FWIW)

YMMV, and for general cases I'd say profile and don't be fooled by theoretical scalings that don't account for cpu architecture etc. A hash map can run slower due to the hash cost, and will consume more memory.

My use case is that I store information for a long time and it gets updates regularly that doesn't change the information_id hash but may change other content.

Every update is then passed to my lookup function to decide if I need to notify externally for this update.

The list of information_ids to notify are in this lookup and can change independently of the information.

By caching the hash for the information_id, it is likely to get reused 10's of times during the lifetime of the information.

My two line change to cache the hash improved unordered_set's runtime by > x8

Test set: Benched on MSVC 2012 update 4 1M entries looked up 10 times each against a 4k and 64k dictionary: All but 10 checks are misses in 4k, 500 hits for 64k (more aardvarks :))

set : 1373 ms / 1938 ms

multiset: 1376 ms / 1913 ms

unordered_set initial 64k bucket / 0.5 load factor: 168 ms / 362 ms

unordered_set 4k / 1.0: 331 ms / 452 ms

c.f pre-cache

unordered_set 64k/0.5: 1519 ms / 1881 ms

FWIW Same things run against MinGW 4.9.1 -O3

set : 2003 ms / 2490 ms

multiset: 1978 ms / 2306 ms

unordered_set initial 64k bucket / 0.5 load factor: 140 ms / 605 ms

unordered_set 4k / 1.0: 318 ms / 683 ms

c.f pre-cache

unordered_set 64k/0.5: 1619 ms / 2455 ms

Paul
  • 161
  • 1
  • 6