0

The history why is long, but the problem is simple. Having 3 strings I need to cache the matching value. To have a fast cache I use the following code:

public int keygen(string a, string b, string c)
    {
        var x = a + "@@" + b + "@@" + c;
        var hash = x.GetHashCode();
        return hash;
    }

(Note that string a,b,c does not contain the code "@@") The cache it self is just a Dictionary<int, object>

I know there is a risk that the hash key might be non unique, but except this:

Does anyone know a faster way to make an int key? (in C#) This operation takes ~15% of total CPU time and this is a long running app.

I have tried a couple of implementations but failed to find any faster.

D Stanley
  • 149,601
  • 11
  • 178
  • 240
PartySvensken
  • 51
  • 1
  • 1
  • 4
  • Have you tried `String.Concat(`, `String.Format(` and `StringBuilder` yet? – Scott Chamberlain Oct 08 '13 at 14:25
  • 1
    Are you trying to produce a unique integer key? Hashing doesn't necessarily do this, even though it is most likely unique. – souldzin Oct 08 '13 at 14:27
  • I'm afraid I wouldn't know of any way of boosting GetHashCode's performance, but I think it's possible you're going for some type of premature optimization here. Logic would dictate comparing two strings to be a time-consuming operation (hence you'd use hashes) but I think C# automatically does something like what you're attempting - so comparing the entire text of your question, to itself, should take the same time as comparing "hi" to "hi". I'd like confirmation of this from another poster or two though... – Katana314 Oct 08 '13 at 14:28
  • What sorts of lengths can we expect to see for the three strings? – Lukazoid Oct 08 '13 at 14:46
  • Hashing *speed* has an inherent trade-off with hashing *quality*. The hash `hash = (int)a[0];` is very fast, but also of very low quality. – RBarryYoung Oct 08 '13 at 15:36
  • Scott, o yeah tried them all. – PartySvensken Oct 09 '13 at 08:00

5 Answers5

4

You should use a Dictionary<Tuple<string,string,string>, object>. Then you don't have to worry about non-uniqueness, since the Dictionary will take care of it for you.

Random832
  • 37,415
  • 3
  • 44
  • 63
  • This is the only solution that is not vulnerable to collisions. Compared to using hashkey of string as key in a `Dictionary` – Magnus Oct 08 '13 at 14:46
  • @TimSchmelter There can still be hash collisions using this implementation but they are handled (with equals check) internally in the dictionary. – Magnus Oct 08 '13 at 14:52
  • @Magnus: I've deleted that comment because you're right. It uses the same algorithm as in J.Skeets famous `GetHashCode` implementation. So collisions are possible (although unlikely). – Tim Schmelter Oct 08 '13 at 14:56
  • @TimSchmelter Did a test with a collection of 10.000 random anonymous objects and got collations on `new { a = 1228545931, b = 694246649, c = 1459763042 }.GetHashCode()` and `new { a = 429549083, b = 16955719, c = 1653772628 }.GetHashCode()` for example. So not totally unlikely. – Magnus Oct 08 '13 at 15:18
  • @Magnus: I haven't said that it impossible but not very likely which is of course subjective. I have also checked it with a random collection but i needed 100000 random hashcodes to get a single collision. By the way, then i also got a collision with [`Tuple.GetHashCode`](http://msdn.microsoft.com/en-us/library/dd414894.aspx). – Tim Schmelter Oct 08 '13 at 15:40
  • Hi, thanks for the suggestions. I have tried to do diffrent version of the Dictionary but they all failed regards to performance. – PartySvensken Oct 09 '13 at 07:36
3

A faster approach would be to compute the hash of each string separately, then combine them using a hash function. This will eliminate the string concatenation which could be taking time.

e.g.

public int KeyGen(string a, string b, string c)
{
    var aHash = a.GetHashCode();
    var bHash = b.GetHashCode();
    var cHash = c.GetHashCode();
    var hash = 36469;
    unchecked
    {
        hash = hash * 17 + aHash;
        hash = hash * 17 + bHash;
        hash = hash * 17 + cHash;
    }
    return hash;
}
Lukazoid
  • 19,016
  • 3
  • 62
  • 85
  • Why is this faster than computing the hash code of one string? – D Stanley Oct 08 '13 at 14:29
  • 2
    The cost is in concatenating the 3 strings together to form one string. This eliminates that concatenation – Lukazoid Oct 08 '13 at 14:30
  • 1
    I'm asking for _proof_ or _documentation_ that concatenating 3 strings and computing one hash code is faster that computing 3 hash codes and mathematically combining them. I don't know which way is faster, but I'd be shocked if the difference was significant. – D Stanley Oct 08 '13 at 14:31
  • @DStanley You are right, the performance difference is going to be largely dependent upon the sizes of the three strings. I will try and find some time to profile this and the original method. – Lukazoid Oct 08 '13 at 14:44
3

Instead of concatenating strings (which creates new strings) you could use XOR or even better simple maths (credits to J.Skeet):

public int keygen(string a, string b, string c)
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = 17;
        hash = hash * 23 + a == null ? 0 : a.GetHashCode();
        hash = hash * 23 + b == null ? 0 : b.GetHashCode();
        hash = hash * 23 + c == null ? 0 : c.GetHashCode();
        return hash;
    }
}

In general it's not necessary to produce unique hashs. But you should minimize collisions.

Another(not as efficient) way is to use an anonymous type which has a builtin support for GetHashCode:

public int keygen(string a, string b, string c)
{
    return new { a, b, c }.GetHashCode();
}

Note that the name, type and order matters for the calculation of the hashcode of an anonymous type.

Community
  • 1
  • 1
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • What is special about the `GetHashCode` implementation of an anonymous type that it is able to avoid collisions? – Lukazoid Oct 08 '13 at 14:52
  • @Lukazoid It doesn't avoid collisions entirely, it's just a type that has a sensibly written `GetHashCode` method that is based on the hashes of the underlying property values. It's likely to be more evenly distributed than a more naive implementation that you would write, and doesn't have as large of risks of making a mistake in the math that causes problems. – Servy Oct 08 '13 at 14:54
  • @Lukazoid: I've deleted that part since i guess that it also can produce collisions (f.e. if it overflows) since it uses the same algorithm as above. – Tim Schmelter Oct 08 '13 at 14:54
  • 2
    @TimSchmelter It doesn't use *exactly* the same algorithm. It uses larger prime numbers and as such will be more likely to be evenly distributed over the range of an int. Of course there are obviously more possible values than there are integers, so avoiding collisions entirely is clearly impossible, they can only be minimized. – Servy Oct 08 '13 at 14:59
  • I think your return value on the second example should read `return new[] { a, b, c }.GetHashCode();`. Without the braces (`[]`), I get "Invalid anonymous type member declarator. Anonymous type members must be declared with a member assignment, simple name or member access." – Drew Chapin Jun 19 '17 at 20:03
  • @DrewChapin what.Net framework version are you using? You ate declaring am array. That doesn't work. Maybe you need to use...a=a,b=b,c=c... – Tim Schmelter Jun 20 '17 at 07:08
  • @TimSchmelter, .NET Framework 4, Visual Studio 2010 SP1 – Drew Chapin Jun 22 '17 at 17:43
1

I know there is a risk that the hash key might be non unique

Hash key's don't have to be unique - they just work better if collisions are minimized.

That said, 15% of your time spent computing a string's hash code seems VERY high. Even switching to string.Concat() (which the compiler may do for you anyways) or StringBuilder shouldn't make that much difference. I'd suggest triple-checking your measurements.

D Stanley
  • 149,601
  • 11
  • 178
  • 240
  • Well I do understand your comment, but sometimes u just have to port code. I am forced to make multiple lookups from code into files, and the cache simply cache values from the files. – PartySvensken Oct 09 '13 at 07:37
1

I’d guess most of the time of this function is spent on building the concatenated string, only to call GetHashCode on it. I would try something like

public int keygen(string a, string b, string c)
{
    return a.GetHashCode() ^ b.GetHashCode() ^ c.GetHashCode();
}

Or possibly use something more complicated than a simple XOR. However, be aware that GetHashCode is not a cryptographic hash function! It is a hash function used for hash tables, not for cryptography, and you should definitely not use it for anything security-related like keys (as your keygen name hints).

Mormegil
  • 7,955
  • 4
  • 42
  • 77
  • He mentioned he is using it as a key for a dictionary (which has its own problems), not as a cryptographic key. – Random832 Oct 08 '13 at 14:30
  • Hi I did try your code with a small modify. Since the order is of importance menaing (a,b,c) != (c,b,a) I changed the code with adding a random fixed number to each XOR term. – PartySvensken Oct 09 '13 at 07:38
  • public int keygen(string a, string b, string c) { var x = (0xFC12341 + a.GetHashCode()) ^ (0xFC1252 + b.GetHashCode()) ^ (0xEA1454 + c.GetHashCode()); return x; } – PartySvensken Oct 09 '13 at 07:39
  • This version did pass a long test with the same result as the old one I had. It is around 3-4 times faster. And VS does not highlight the string concat anymore :) So big THANKS! – PartySvensken Oct 09 '13 at 07:40
  • There was an error in this. After some longer tests the new keygen failed. Testing with a="mort_basis_tbl" b = "makeham_a" c = "3" it gives the same key as a="mort_basis_tbl" b = "makeham_b" c = "0". So I am back with the orginal formula. – PartySvensken Oct 10 '13 at 06:09