1

Can anyone tell me what is the mathematical logic behind this simple hash function.

    #define HASHSIZE 101 
    unsigned hash_function(char *s) {
    unsigned hashval;
    for (hashval = 0; *s != '\0'; s++)
    // THIS NEXT LINE HOW DOES IT WORK WHY 31 
    hashval = *s + 31 * hashval; 
    return hashval % HASHSIZE;
    }

Here, I am not asking about pointers and programming. I am just asking about how the below statement works.

    hashval = *s + 31 * hashval
user1717259
  • 2,717
  • 6
  • 30
  • 44
Mohammed Talat
  • 77
  • 2
  • 12

1 Answers1

3

Say you have a value x with bits...

            b7 b6 b5 b4 b3 b2 b1 b0

When you multiply by 31, you're effectively adding those bits shifted left by one (which has the same effect as multiplying by two, much as adding a trialing zero in decimal multiplies by ten), by two (i.e. 4x), by three (8x), by four (16x):

            b7 b6 b5 b4 b3 b2 b1 b0 +     // 1x
         b7 b6 b5 b4 b3 b2 b1 b0 0  +     // + 2x = 3x
      b7 b6 b5 b4 b3 b2 b1 b0 0  0  +     // + 4x = 7x
   b7 b6 b5 b4 b3 b2 b1 b0 0  0  0  +     // + 8x = 15x
b7 b6 b5 b4 b3 b2 b1 b0 0  0  0  0  +     // + 16x = 31x

Many of the individual output bits are affected by the values of many input buts, both directly and from "carries" from less-significant columns; e.g. if b1 = 1 and b0 = 1, the second least significant output bit (the "10"s column) will be 0 but carry over 1 to the "100"s column at left.

This property of having output bits affected by many input bits helps "mix" the output hash value, improving quality.

Still, while it might be better than multiplying by say 17 (16+1) or 12 (8+4) as they only add in a couple bit-shifted copies of the original value instead of five, it's a very weak hash compared to hash functions that do more than a single running multiplication, and I'll illustrate that using some statistical analysis...

As a sampling of hashing quality, I hashed all combinations of four printable ASCII characters and looked at how many times the same hash values resulted. I picked four because it's the most doable in a reasonable timeframe (tens of seconds). The code for that's available here (not sure if it will time out there - have only run it locally) and at the bottom of this post.

It may help to talk you through a line or two of the output below to explain the per-line format:

  • there were only 62 inputs (0.000076% of inputs) that hashed to a unique value
  • there were 62 other times that 2 inputs hashed to the same value (i.e. 124 inputs accounted for by this 2-way collision group, which is 0.000152%)
    • this and lower-collision groups cumulatively account for 0.000228% of inputs

Full output:

81450625 4-character printable ASCII combinations
#collisions 1 with #times 62            0.000076% 0.000076%
#collisions 2 with #times 62            0.000152% 0.000228%
#collisions 3 with #times 1686          0.006210% 0.006438%
#collisions 4 with #times 170           0.000835% 0.007273%
#collisions 5 with #times 62            0.000381% 0.007654%
#collisions 6 with #times 1686          0.012420% 0.020074%
#collisions 7 with #times 62            0.000533% 0.020606%
#collisions 8 with #times 170           0.001670% 0.022276%
#collisions 9 with #times 45534         0.503134% 0.525410%
#collisions 10 with #times 3252         0.039926% 0.565336%
#collisions 11 with #times 3310         0.044702% 0.610038%
#collisions 12 with #times 4590         0.067624% 0.677662%
#collisions 13 with #times 340          0.005427% 0.683089%
#collisions 14 with #times 456          0.007838% 0.690927%
#collisions 15 with #times 1566         0.028840% 0.719766%
#collisions 16 with #times 224          0.004400% 0.724166%
#collisions 17 with #times 124          0.002588% 0.726754%
#collisions 18 with #times 45422                1.003793% 1.730548%
#collisions 19 with #times 116          0.002706% 1.733254%
#collisions 20 with #times 3414         0.083830% 1.817084%
#collisions 21 with #times 1632         0.042077% 1.859161%
#collisions 22 with #times 3256         0.087945% 1.947106%
#collisions 23 with #times 58           0.001638% 1.948744%
#collisions 24 with #times 4702         0.138548% 2.087292%
#collisions 25 with #times 66           0.002026% 2.089317%
#collisions 26 with #times 286          0.009129% 2.098447%
#collisions 27 with #times 1969365              65.282317% 67.380763%
#collisions 28 with #times 498          0.017120% 67.397883%
#collisions 29 with #times 58           0.002065% 67.399948%
#collisions 30 with #times 284614               10.482940% 77.882888%
#collisions 31 with #times 5402         0.205599% 78.088487%
#collisions 32 with #times 108          0.004243% 78.092730%
#collisions 33 with #times 289884               11.744750% 89.837480%
#collisions 34 with #times 5344         0.223075% 90.060555%
#collisions 35 with #times 5344         0.229636% 90.290191%
#collisions 36 with #times 146792               6.487994% 96.778186%
#collisions 38 with #times 5344         0.249319% 97.027505%
#collisions 39 with #times 20364                0.975064% 98.002569%
#collisions 40 with #times 9940         0.488148% 98.490718%
#collisions 42 with #times 14532                0.749342% 99.240060%
#collisions 43 with #times 368          0.019428% 99.259488%
#collisions 44 with #times 10304                0.556627% 99.816114%
#collisions 45 with #times 368          0.020331% 99.836446%
#collisions 46 with #times 368          0.020783% 99.857229%
#collisions 47 with #times 736          0.042470% 99.899699%
#collisions 48 with #times 368          0.021687% 99.921386%
#collisions 49 with #times 368          0.022139% 99.943524%
#collisions 50 with #times 920          0.056476% 100.000000%

Overall observations:

  • 65.3% of inputs end up in 27-way collisions; 11.7% in 33-way; 10.5% in 30-way collisions; 6.5% in 36-way...
  • less than 2.1% of inputs avoided a 27-way or worse collision

This is despite inputs amounting to only 1.9% of the hash space (calculated as 81450625 in 2^32, as we're hashing to 32-bit values). It's abysmal.

To show just how abysmal it is, let's compare with putting the 4 printable ASCII characters into std::strings for GCC std::hash<std::string>, which from memory I believe uses a MURMUR32 hash:

81450625 4-character printable ASCII combinations
#collisions 1 with #times 79921222              98.122294% 98.122294%
#collisions 2 with #times 757434                1.859860% 99.982155%
#collisions 3 with #times 4809          0.017713% 99.999867%
#collisions 4 with #times 27            0.000133% 100.000000%

So - back to the question of why + 31 * previous - you'd have to compare with other equally simple hashing functions to see whether this is better than average for the CPU effort of generating the hash, which is vaguely possible despite it being so bad in an absolute sense, but given the extra cost of a massively better hash is pretty small I'd recommend using one and forgetting "*31" completely.

Code:

#include <map>
#include <iostream>
#include <iomanip>

int main()
{
    std::map<unsigned, unsigned> histogram;

    for (int i = ' '; i <= '~'; ++i)
        for (int j = ' '; j <= '~'; ++j)
            for (int k = ' '; k <= '~'; ++k)
                for (int l = ' '; l <= '~'; ++l)
                {
                    unsigned hv = ((i * 31 + j) * 31 + k) * 31 + l;
                    /*
                    // use "31*" hash above OR std::hash<std::string> below...
                    char c[] = { i, j, k, l, '\0' };
                    unsigned hv = std::hash<std::string>()(c); */
                    ++histogram[hv];
                }
    std::map<unsigned, unsigned> histohisto;
    for (auto& hv_freq : histogram)
        ++histohisto[hv_freq.second];
    unsigned n = '~' - ' ' + 1; n *= n; n *= n;
    std::cout << n << " 4-character printable ASCII combinations\n";
    double cumulative_percentage = 0;
    for (auto& freq_n : histohisto)
    {
        double percent = (double)freq_n.first * freq_n.second / n * 100;
        cumulative_percentage += percent;
        std::cout << "#collisions " << freq_n.first << " with #times " << freq_n.second << "\t\t" << std::fixed << percent << "% " << cumulative_percentage << "%\n";
    }
}
Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • i will need some time to get my head around this stuff, i thought it is a simple question and has a simple answer but i guess i was wrong, thanks for taking the time for answering this. – Mohammed Talat Jan 09 '16 at 11:21
  • 2
    @mohammed sure - understanding and evaluating hash functions is a complicated matter. Cheers. – Tony Delroy Jan 09 '16 at 12:45