9

I am using a hash set in which I store array of integers(32 bits). This means I need an algorithm to hash an array of integers. I am looking for a 32 bits integer(C# int) hash.

I have tried and edited two existing algorithms, which you can see the four versions of at the bottom, including their benchmark.

My questions are as follows:

1. Do you think the bottom algorithm is good for this purpose?

2. Is there a better algorithm available for this purpose?

Program information

  • Typically an array has 16 entries, and the integers are smaller than 10, although both must support larger values. I could say the largest values that have any chance on occurring are 200 entries and integers of value 20.
  • I am using the HashSet in a breath-first search algorithm, to compare if two nodes are the same. http://en.wikipedia.org/wiki/Breadth-first_search.
  • For this specific program I am unable to use unsafe code.

Benchmarks and code

below are my benchmarks and code, from worst to best performance in my program.

  • Coordinates2D is a struct containing an int x and an int y.
  • The total entries in the HashSet at run end is 356525
  • I could not retrieve the number of collisions exactly. The number given is the number of times an object was actually compared and not equal(same hash, different object). This happens multiple times between the same objects though. This value varies a bit per execution, as the program is multithreaded.
  • MurMurHash3 seed is const uint seed = 144

MurMurHash3 using bytes retrieved from the coordinates directly

Code is equal to https://gist.github.com/automatonic/3725443 The array of bytes is retrieved using the following code:

int size = Marshal.SizeOf(typeof(Coordinates2D));
int length = carCoords.Length;
Byte[] bytes = new Byte[size * length];
for (int i = 0; i < length; ++i)
{
    GCHandle pinStructure = GCHandle.Alloc(carCoords[i], GCHandleType.Pinned);
    Marshal.Copy(pinStructure.AddrOfPinnedObject(), bytes, i*size, size);
    pinStructure.Free();
}

// Hash the byte array
return MurMurHash3.Hash(new System.IO.MemoryStream(bytes));

This is incredibly inefficient, because of the copying.

  • Performance: 40880ms
  • Collisions: < 84

MurMurHash3 using bytes retrieved from the integers in the objects

public static int Hash2(RushHourPathLengthNode.Coordinates2D[] coords)
{
    const uint c1 = 0xcc9e2d51;
    const uint c2 = 0x1b873593;

    uint h1 = seed;
    uint k1 = 0;
    uint streamLength = (uint)coords.Length * 2;

    for (int i = 0, l = coords.Length; i < l; ++i)
    {
        // Do it for X
        byte[] chunk = BitConverter.GetBytes(coords[i].x);

        /* Get four bytes from the input into an uint */
        k1 = (uint)
           (chunk[0]
          | chunk[1] << 8
          | chunk[2] << 16
          | chunk[3] << 24);

        /* bitmagic hash */
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;


        // Do it for y
        chunk = BitConverter.GetBytes(coords[i].y);

        /* Get four bytes from the input into an uint */
        k1 = (uint)
           (chunk[0]
          | chunk[1] << 8
          | chunk[2] << 16
          | chunk[3] << 24);

        /* bitmagic hash */
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;
    }

    // finalization, magic chants to wrap it all up
    h1 ^= streamLength;
    h1 = fmix(h1);

    unchecked //ignore overflow
    {
        return (int)h1;
    }
}

This is alot more efficient now the copying is gone.

  • Performance: 16640ms
  • Collisions: < 92

MurMurHash3 using integers

public static int Hash(RushHourPathLengthNode.Coordinates2D[] coords)
{
    const uint c1 = 0xcc9e2d51;
    const uint c2 = 0x1b873593;

    uint h1 = seed;
    uint k1 = 0;
    uint streamLength = (uint)coords.Length * 2;

    for (int i = 0, l = coords.Length; i < l; ++i)
    {
        k1 = (uint)coords[i].x;

        //bitmagic hash
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;

        k1 = (uint)coords[i].y;

        //bitmagic hash
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;
    }

    // finalization, magic chants to wrap it all up
    h1 ^= streamLength;
    h1 = fmix(h1);

    unchecked //ignore overflow
    {
        return (int)h1;
    }
}
  • Performance: 13027ms
  • Collisions: < 95

Hash using integer addition multiplication

int hash = 17;
for (int i = 0, l = carCoords.Length; i < l; ++i)
{
    hash = hash * 31 + carCoords[i].x;
    hash = hash * 31 + carCoords[i].y;
}
return hash;
  • Performance: 4564ms
  • Collisions: < 44

As you see, this one is far more efficient. It works well with any prime numbers. As I understand, there is no scientific proof of this to work, which I am not too fond of.

According to Michal B. a faster version would be using bitshifting. However, testing shows that this is not a successful hash. The problem takes significantly longer to run(It did not finish within 5 minutes). The bitshifting might be good, but it seems like the 31(prime number) is crucial.

int hash = 17;
for (int i = 0, l = carCoords.Length; i < l; ++i)
{
    hash = hash << 5 - carCoords[i].x;
    hash = hash << 5 - carCoords[i].y;
}
return hash;
Aart Stuurman
  • 3,188
  • 4
  • 26
  • 44
  • 1
    I suppose you don't want to use `unsafe` to handle the `int[]` as if it was a `byte[]`? – rene Nov 08 '13 at 08:39
  • For this specific program I am unable to use unsafe code. Thanks for the idea though. – Aart Stuurman Nov 08 '13 at 08:40
  • 2
    I think that'll be fine for a hash code. It wouldn't really be good enough for CRC checking, but for hash code you just need a reasonable distribution, which this should give. Also see this thread: http://stackoverflow.com/questions/3404715/c-sharp-hashcode-for-array-of-ints – Matthew Watson Nov 08 '13 at 08:42
  • Thanks you. I read the thread earlier. Doc Brown's answer is similar to what I provided here. I cannot say anything about how good the algorithm provided by BlueMonkMN is. Maybe you can? – Aart Stuurman Nov 08 '13 at 08:48
  • My answer to your first question is: this correction is horrible; don't use it. As an answer for your second question I want to ask you something: what are the exact constraints which have to be verified by your hashing algorithm? A hash table can include any kind of code (and you might create your own one outputing integers, by relying on a simple random generator, for example). Or if you are so interested in the algorithm from the link you provide, why not modifying it such that it deals with/delivers integers? Or are you expecting us to do that for you? – varocarbas Nov 08 '13 at 09:03
  • On you answer: why do you consider it bad? To continue. I do not expect you to alter MurMurHash to my needs. I could do that myself, but I put it aside for a moment, because I am curious if there exists an algorithm for this purpose already, since I guess it is pretty common. Could you explain a bit more what you mean by constraints? – Aart Stuurman Nov 08 '13 at 09:21
  • I have modified the MurMurHash3 algorithm provided in the link to use integers. It computes much slower than the code I provided in this post, but it has less collisions. Because of the computation speed my program using MurMur is 3 times slower though. – Aart Stuurman Nov 08 '13 at 09:36
  • Why do I consider that performing a conversion between two completely different types (not even linearly related) by doing an addition of a single number is a bad idea? The question is how you can consider it good at all? (I was looking for a C# post, but just found a Java one, for a proper conversion: http://stackoverflow.com/questions/1026761/how-to-convert-a-byte-array-to-its-numeric-value-java). What you did is more or less the same than the following: I know that 2+2 = 4 and 2*2 = 4 thus I conclude that x+2 = x*2 -> a completely wrong assumption with no reliability at all. – varocarbas Nov 08 '13 at 09:38
  • Still you haven't answered the really important question: why you need to rely on this algorithm at all? Also not sure how you have modified it and what was the final result (the converted-to-ints algorithm delivers an overall better performance or not?). Also, in order to discuss about anything, you should post the code we are discussing about. PS: if you don't write @ + nick and you are not in a question/answer created by this person, he might not get your messages. – varocarbas Nov 08 '13 at 09:41
  • @varocarbas I did not know the latter, thanks. I have to go now, but I will take my time answering everything in in 3 hours. Thanks – Aart Stuurman Nov 08 '13 at 09:45
  • You are welcome. I look forward to your answer. – varocarbas Nov 08 '13 at 09:49
  • Well, I started to write an answer, with the why and how of hashing, but I guess I'll wait for the big update as well now :) – Noctis Nov 08 '13 at 10:29
  • OK. I see it. All the codes output the same values? If this is the case, then the method to be chosen is clear. Logically, I cannot know what each code really does without doing various step-by-step runs. If all what the original source does can be emulated with your last code, there are two conclusions: 1. the original code was unnecessarily complex (to do something which still is not clear to me!! What is the point of this EXACT hashing technique?!); 2. you should chose your method. BUT what you cannot say is that you are converting bytes to integers by doing that... – varocarbas Nov 08 '13 at 13:45
  • ... and asking for a confirmation to check whether this is a good solution. This is a horrible solution to convert bytes to integers. But might be a perfect solution for this specific case (whose exact point, I insist, still don't understand). For example: 2.0 + 2.0 + 2.0; 3.0 + 3.0 + 3.0; etc. can be replaced with i*3 (where i is an integer) and this is surely a good solution for this problem (the question is: why this wasn't done in a first moment?), but you CANNOT say that doing i*3 converts doubles into integers. Hope that my idea is clearer now. – varocarbas Nov 08 '13 at 13:48
  • It is now clear to me where you think I convert bytes to integers. Where do you think I do this? – Aart Stuurman Nov 08 '13 at 13:50
  • PS: an actually if the results much exactly, my initial question (what is the exact point of this hashing technique?) is completely re-inforced: you are just adding the bits times 31! -> perfect hashing same thing that times a random number or adding abc at the end or whatever other thing. What was the point to consider this algorithm at all continues being the most relevant question here, IMO. – varocarbas Nov 08 '13 at 13:52
  • I might have misunderstood your point when you were talking about intending to skip the byte to integer conversion. But then I was pretty clear with the link I provided (explaining a conversion from byte to integer) which you also misunderstood, so I guess that we are even :) – varocarbas Nov 08 '13 at 13:53
  • The important point here is that I couldn't see the point of intending to improve something you shouldn't have relied at all in a first moment. You were meant to generate any code (apparently, under no constraint; I meant for example: "the hash code has to be a number between 1 to 50000; has to be even; its digits has to, etc.") and instead of creating a code to do that; you spent time finding something already done; more time intending to understand it and adapt it to your needs; more time intending to improve it further; and finally concluded that it was x + y*31. Well... you have got +7 :) – varocarbas Nov 08 '13 at 13:57
  • Instead of multiplying by 31, you can bit shift by 5 and subtract the number, which would increase the performance: `hash = hash << 5 - hash` – Michal B. Nov 08 '13 at 13:58
  • Alright. I know the last algorithm is not very logical(although it works in this case). This is why I hope someone comes up with a algorithm specifically designed for an integer array, instead of my algorithms, of which one is designed for a random byte array, and the other one, although it works here, is not a good hash for an integer array. – Aart Stuurman Nov 08 '13 at 13:58
  • @MichalB. Am I right to say that gives the exact same answer, but with better performance? I do not have good knowledge about bitshifting. I will look into it. – Aart Stuurman Nov 08 '13 at 14:00
  • @AartStuurman: yes, I didn't look closely at the question, but since you included performance measure, I gave you an idea how you can improve it. Btw. I still think that this algorithm is a good hash for an integer array – Michal B. Nov 08 '13 at 14:22
  • @MichalB. I did know some sort of upgrade using bitshifting existed, but I forgot to actually use it. Thanks for pointing it out – Aart Stuurman Nov 08 '13 at 14:25
  • can you post somewhere a textfile with the value used in the benchmark and an easy way to preload everything into the program, to test our self? – Fredou Nov 08 '13 at 14:44
  • and what is the maximum value of X and Y ? – Fredou Nov 08 '13 at 14:45
  • @aart stuurman if you are interested I have a murmurhash 3 library that may perform better. http://github.com/darrenkopp/murmurhash-net. Its also on NuGet murmurhash – Darren Kopp Nov 08 '13 at 14:58
  • Thank you. I will have a look, but I think murmurhash is too complex for my problem. – Aart Stuurman Nov 08 '13 at 15:09

2 Answers2

4

In the end I went with the last algorithm.

int hash = 17;
for (int i = 0, l = carCoords.Length; i < l; ++i)
{
    hash = hash * 19 + carCoords[i].x;
    hash = hash * 19 + carCoords[i].y;
}
return hash;

This is very fast to compute, and for the (small) numbers I am using the hash is awesome.

If you are going to use this, make sure the numbers you use are prime numbers. Because of this you cannot use bitshifting to optimize it.

Aart Stuurman
  • 3,188
  • 4
  • 26
  • 44
3

Have you considered using a space-filling curve to generate the hash? This will minimize (or eliminate) collisions for the chosen resolution (maxX, maxY)

Here are two SO questions and their answers that use this method.

  1. Mapping N-dimensional value to a point on Hilbert curve
  2. Calculate the Hilbert value of a point for use in a Hilbert R-Tree?

Hope this helps!

Community
  • 1
  • 1
Ani
  • 10,826
  • 3
  • 27
  • 46
  • This looks very promising. I will have a look into it. – Aart Stuurman Nov 08 '13 at 14:07
  • I have tried this, together with Z-order curve. These are both great, but not for this problem. Mapping the positions onto an int works, but it still leaves me with an array of integers. I can repeat the process or do something similar again, but in the end MurMurHash3 was faster. – Aart Stuurman Nov 08 '13 at 15:12