37

I want to hash a string of length up-to 30. What will be the best idea to do that if time is my concern. The function will be called over 100 million times. currently I am using the following code,

static UInt64 CalculateHash(string read, bool lowTolerance)
{
    UInt64 hashedValue = 0;
    int i = 0;
    while (i < read.Length)
    {
        hashedValue += read.ElementAt(i) * (UInt64)Math.Pow(31, i);
        if (lowTolerance) i += 2;
        else i++;
    }
    return hashedValue;
}
Ajay
  • 18,086
  • 12
  • 59
  • 105
P basak
  • 4,874
  • 11
  • 40
  • 63
  • 7
    Is there a reason that the `Object.GetHashCode()` method won't work for you? It seems like you're pretty much reimplementing the same concept. – Ken Wayne VanderLinde Mar 03 '12 at 11:20
  • 3
    Anything that doesn't use *floating point math* will be faster. – David Schwartz Mar 03 '12 at 11:21
  • 1
    GetHashCode is not persistable, so if he needs to store the Hash Code into a database, it's not useful. Then again, neither is this. What is your usage? Do you only need to hash the string at runtime, or what do you need to do with the Hash? Adler-32 might be an option if you need to store it and don't run into too many collisions. – Michael Stum Mar 03 '12 at 11:26
  • I just need the data at run time. But i will be calling this function over 100 million times. – P basak Mar 03 '12 at 11:31
  • 1
    I still don't understand what's wrong with the standard string hash-code functinn. If you need to persist it, calculating 100,000,000 hashes is going to take a fraction of the time it takes to actually store them to a database. – zmbq Mar 03 '12 at 11:44
  • @zmbq the default hash function is not allowing to hash strings over 15. – P basak Mar 03 '12 at 12:23
  • @Pbasak "the default hash function is not allowing to hash strings over 15." wat? Do you have any proof for that far fetched claim? – CodesInChaos Mar 03 '12 at 12:35
  • i mean when i use strings with length over 14 the hash function is returning negative values. – P basak Mar 03 '12 at 12:58
  • 2
    @Pbasak Then cast it to `uint` or mask it with `0x7FFFFF`. – CodesInChaos Mar 03 '12 at 13:36
  • 25
    **Run a profiler**. That will tell you what the slow part is. Then **fix the slow part**. – Eric Lippert Mar 03 '12 at 15:24
  • 1
    What's the problem with having negative numbers? It's really hard to understand what you actually need to do. – Michael Stum Mar 05 '12 at 17:27
  • 1
    Great analysis on the subject: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed – tozevv Sep 13 '13 at 14:42
  • I don't think this is a good function. Starting from `i == 13` the result of `(UInt64)Math.Pow(31, i)` will be always 0 – the_joric Mar 03 '12 at 11:24

3 Answers3

47
static UInt64 CalculateHash(string read)
{
    UInt64 hashedValue = 3074457345618258791ul;
    for(int i=0; i<read.Length; i++)
    {
        hashedValue += read[i];
        hashedValue *= 3074457345618258799ul;
    }
    return hashedValue;
}

This is a Knuth hash. You can also use Jenkins.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • 2
    According to my own test, this function does not achieve avalanche. YMMV. – Fantius May 28 '12 at 09:58
  • @Fantius: Can you try using `11400714819306691477ul` instead, please. (For both values.) – David Schwartz May 28 '12 at 10:13
  • 3
    It's worse. But I should quantify my original statement. Toggling a single bit on the input results in about 49.40% of the output bits toggling (using your original constant), which is MUCH better than Bernstein-based functions. That's probably good enough for most uses. But, for instance, SuperFastHash (http://landman-code.blogspot.com/2009/02/c-superfasthash-and-murmurhash2.html) is giving me 50.02%. And Murmur2 on the same page is giving me 50.04%. – Fantius May 28 '12 at 11:31
  • 5
    It's not intended for applications where you care about that. It's just intended to be used to distribute strings in a hash table. – David Schwartz May 28 '12 at 11:34
  • 1
    Could you please provide citation for this algorithm? I searched TAOCP Vol 3 but can't find these constants you have. – Shital Shah Jun 10 '15 at 11:22
  • @ShitalShah I don't think Knuth gave the constants for a 64-bit version. – David Schwartz Jun 10 '15 at 11:32
  • Is it from TAOCP Vol 3? Any reference to anything else where this algorithm came from? – Shital Shah Jun 10 '15 at 11:35
  • 1
    @ShitalShah I'm pretty sure it's from TAOCP, but I'm not sure which volume. – David Schwartz Jun 10 '15 at 19:06
8

First of all, consider using GetHashCode().

A simple improvement on your existing implementation:

static UInt64 CalculateHash(string read, bool lowTolerance)
{
    UInt64 hashedValue = 0;
    int i = 0;
    ulong multiplier = 1;
    while (i < read.Length)
    {
        hashedValue += read[i] * multiplier;
        multiplier *= 37;
        if (lowTolerance) i += 2;
        else i++;
    }
    return hashedValue;
}

It avoids the expensive floating point calculation, and the overhead of ElementAt.

Btw (UInt64)Math.Pow(31, i) doesn't work well for longer strings. Floating point rounding will lead to a multiplier of 0 for characters beyond 15 or so.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • 1
    The multiplier must start at a prime value greater than 256 or this breaks horribly if the first byte is small. – David Schwartz Mar 03 '12 at 11:30
  • @DavidSchwartz A larger prime is certainly better, but breaking horribly is a bit of an overstatement. – CodesInChaos Mar 03 '12 at 11:33
  • 1
    If a 64-bit hash function has numerous 2-byte inputs that collide, IMO it breaks horribly. (But given how awful the function the OP started with, maybe my standards are too high.) – David Schwartz Mar 03 '12 at 11:35
  • Even with a prime >256 but <65536 there will be two char collisions. C# works on UTF-16 codepoints, not single byte chars. – CodesInChaos Mar 03 '12 at 11:38
  • 1
    Just a warning for everyone who's on .NET Core: in .NET Core `GetHashCode` is randmized between app restarts! Which means you get a different hash for the same string every time the app is restarted/recycled – Alex from Jitbit Dec 10 '21 at 09:45
2

To speed up your implementation, the (UInt64)Math.Pow(31, i) call should be replaced by a lookup: pre-calculate a table of the first 30 powers of 31, and use it at runtime. Since the limit on length is 30, you need only 31 element:

private static unsigned long[] Pow31 = new unsigned long[31];

static HashCalc() {
    Pow31[0] = 1;
    for (int i = 1 ; i != Pow31.Length ; i++) {
        Pow31[i] = 31*Pow31[i-1];
    }
}

// In your hash function...
hashedValue += read.ElementAt(i) * Pow31[i];
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523