Fastest way to make a hashkey of multiple strings

Question

The history why is long, but the problem is simple. Having 3 strings I need to cache the matching value. To have a fast cache I use the following code:

public int keygen(string a, string b, string c)
    {
        var x = a + "@@" + b + "@@" + c;
        var hash = x.GetHashCode();
        return hash;
    }

(Note that string a,b,c does not contain the code "@@") The cache it self is just a Dictionary<int, object>

I know there is a risk that the hash key might be non unique, but except this:

Does anyone know a faster way to make an int key? (in C#) This operation takes ~15% of total CPU time and this is a long running app.

I have tried a couple of implementations but failed to find any faster.

Have you tried `String.Concat(`, `String.Format(` and `StringBuilder` yet? — Scott Chamberlain, Oct 08 '13 at 14:25
Are you trying to produce a unique integer key? Hashing doesn't necessarily do this, even though it is most likely unique. — souldzin, Oct 08 '13 at 14:27
I'm afraid I wouldn't know of any way of boosting GetHashCode's performance, but I think it's possible you're going for some type of premature optimization here. Logic would dictate comparing two strings to be a time-consuming operation (hence you'd use hashes) but I think C# automatically does something like what you're attempting - so comparing the entire text of your question, to itself, should take the same time as comparing "hi" to "hi". I'd like confirmation of this from another poster or two though... — Katana314, Oct 08 '13 at 14:28
What sorts of lengths can we expect to see for the three strings? — Lukazoid, Oct 08 '13 at 14:46
Hashing *speed* has an inherent trade-off with hashing *quality*. The hash `hash = (int)a[0];` is very fast, but also of very low quality. — RBarryYoung, Oct 08 '13 at 15:36

score 4 · Answer 1 · answered Oct 08 '13 at 14:29

4

You should use a Dictionary<Tuple<string,string,string>, object>. Then you don't have to worry about non-uniqueness, since the Dictionary will take care of it for you.

answered Oct 08 '13 at 14:29

Random832

37,415
3
44
63

This is the only solution that is not vulnerable to collisions. Compared to using hashkey of string as key in a `Dictionary` – Magnus Oct 08 '13 at 14:46
@TimSchmelter There can still be hash collisions using this implementation but they are handled (with equals check) internally in the dictionary. – Magnus Oct 08 '13 at 14:52
@Magnus: I've deleted that comment because you're right. It uses the same algorithm as in J.Skeets famous `GetHashCode` implementation. So collisions are possible (although unlikely). – Tim Schmelter Oct 08 '13 at 14:56
@TimSchmelter Did a test with a collection of 10.000 random anonymous objects and got collations on `new { a = 1228545931, b = 694246649, c = 1459763042 }.GetHashCode()` and `new { a = 429549083, b = 16955719, c = 1653772628 }.GetHashCode()` for example. So not totally unlikely. – Magnus Oct 08 '13 at 15:18
@Magnus: I haven't said that it impossible but not very likely which is of course subjective. I have also checked it with a random collection but i needed 100000 random hashcodes to get a single collision. By the way, then i also got a collision with [`Tuple.GetHashCode`](http://msdn.microsoft.com/en-us/library/dd414894.aspx). – Tim Schmelter Oct 08 '13 at 15:40
Hi, thanks for the suggestions. I have tried to do diffrent version of the Dictionary but they all failed regards to performance. – PartySvensken Oct 09 '13 at 07:36

score 3 · Answer 2 · answered Oct 08 '13 at 14:25

3

A faster approach would be to compute the hash of each string separately, then combine them using a hash function. This will eliminate the string concatenation which could be taking time.

e.g.

public int KeyGen(string a, string b, string c)
{
    var aHash = a.GetHashCode();
    var bHash = b.GetHashCode();
    var cHash = c.GetHashCode();
    var hash = 36469;
    unchecked
    {
        hash = hash * 17 + aHash;
        hash = hash * 17 + bHash;
        hash = hash * 17 + cHash;
    }
    return hash;
}

answered Oct 08 '13 at 14:25

Lukazoid

19,016
3
62
85

Why is this faster than computing the hash code of one string? – D Stanley Oct 08 '13 at 14:29
2

The cost is in concatenating the 3 strings together to form one string. This eliminates that concatenation – Lukazoid Oct 08 '13 at 14:30
1

I'm asking for _proof_ or _documentation_ that concatenating 3 strings and computing one hash code is faster that computing 3 hash codes and mathematically combining them. I don't know which way is faster, but I'd be shocked if the difference was significant. – D Stanley Oct 08 '13 at 14:31
@DStanley You are right, the performance difference is going to be largely dependent upon the sizes of the three strings. I will try and find some time to profile this and the original method. – Lukazoid Oct 08 '13 at 14:44

score 3 · Accepted Answer · edited May 23 '17 at 11:50

3

Instead of concatenating strings (which creates new strings) you could use XOR or even better simple maths (credits to J.Skeet):

public int keygen(string a, string b, string c)
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = 17;
        hash = hash * 23 + a == null ? 0 : a.GetHashCode();
        hash = hash * 23 + b == null ? 0 : b.GetHashCode();
        hash = hash * 23 + c == null ? 0 : c.GetHashCode();
        return hash;
    }
}

In general it's not necessary to produce unique hashs. But you should minimize collisions.

Another(not as efficient) way is to use an anonymous type which has a builtin support for GetHashCode:

public int keygen(string a, string b, string c)
{
    return new { a, b, c }.GetHashCode();
}

Note that the name, type and order matters for the calculation of the hashcode of an anonymous type.

edited May 23 '17 at 11:50

Community

1
1

answered Oct 08 '13 at 14:29

Tim Schmelter

450,073
74
686
939

What is special about the `GetHashCode` implementation of an anonymous type that it is able to avoid collisions? – Lukazoid Oct 08 '13 at 14:52
@Lukazoid It doesn't avoid collisions entirely, it's just a type that has a sensibly written `GetHashCode` method that is based on the hashes of the underlying property values. It's likely to be more evenly distributed than a more naive implementation that you would write, and doesn't have as large of risks of making a mistake in the math that causes problems. – Servy Oct 08 '13 at 14:54
@Lukazoid: I've deleted that part since i guess that it also can produce collisions (f.e. if it overflows) since it uses the same algorithm as above. – Tim Schmelter Oct 08 '13 at 14:54
2

@TimSchmelter It doesn't use *exactly* the same algorithm. It uses larger prime numbers and as such will be more likely to be evenly distributed over the range of an int. Of course there are obviously more possible values than there are integers, so avoiding collisions entirely is clearly impossible, they can only be minimized. – Servy Oct 08 '13 at 14:59
I think your return value on the second example should read `return new[] { a, b, c }.GetHashCode();`. Without the braces (`[]`), I get "Invalid anonymous type member declarator. Anonymous type members must be declared with a member assignment, simple name or member access." – Drew Chapin Jun 19 '17 at 20:03
@DrewChapin what.Net framework version are you using? You ate declaring am array. That doesn't work. Maybe you need to use...a=a,b=b,c=c... – Tim Schmelter Jun 20 '17 at 07:08
@TimSchmelter, .NET Framework 4, Visual Studio 2010 SP1 – Drew Chapin Jun 22 '17 at 17:43

score 1 · Answer 4 · answered Oct 08 '13 at 14:25

1

I know there is a risk that the hash key might be non unique

Hash key's don't have to be unique - they just work better if collisions are minimized.

That said, 15% of your time spent computing a string's hash code seems VERY high. Even switching to string.Concat() (which the compiler may do for you anyways) or StringBuilder shouldn't make that much difference. I'd suggest triple-checking your measurements.

answered Oct 08 '13 at 14:25

D Stanley

149,601
11
178
240

Well I do understand your comment, but sometimes u just have to port code. I am forced to make multiple lookups from code into files, and the cache simply cache values from the files. – PartySvensken Oct 09 '13 at 07:37

score 1 · Answer 5 · answered Oct 08 '13 at 14:30

1

I’d guess most of the time of this function is spent on building the concatenated string, only to call GetHashCode on it. I would try something like

public int keygen(string a, string b, string c)
{
    return a.GetHashCode() ^ b.GetHashCode() ^ c.GetHashCode();
}

Or possibly use something more complicated than a simple XOR. However, be aware that GetHashCode is not a cryptographic hash function! It is a hash function used for hash tables, not for cryptography, and you should definitely not use it for anything security-related like keys (as your keygen name hints).

answered Oct 08 '13 at 14:30

Mormegil

7,955
4
42
77

He mentioned he is using it as a key for a dictionary (which has its own problems), not as a cryptographic key. – Random832 Oct 08 '13 at 14:30
Hi I did try your code with a small modify. Since the order is of importance menaing (a,b,c) != (c,b,a) I changed the code with adding a random fixed number to each XOR term. – PartySvensken Oct 09 '13 at 07:38
public int keygen(string a, string b, string c) { var x = (0xFC12341 + a.GetHashCode()) ^ (0xFC1252 + b.GetHashCode()) ^ (0xEA1454 + c.GetHashCode()); return x; } – PartySvensken Oct 09 '13 at 07:39
This version did pass a long test with the same result as the old one I had. It is around 3-4 times faster. And VS does not highlight the string concat anymore :) So big THANKS! – PartySvensken Oct 09 '13 at 07:40
There was an error in this. After some longer tests the new keygen failed. Testing with a="mort_basis_tbl" b = "makeham_a" c = "3" it gives the same key as a="mort_basis_tbl" b = "makeham_b" c = "0". So I am back with the orginal formula. – PartySvensken Oct 10 '13 at 06:09

Fastest way to make a hashkey of multiple strings

5 Answers5