2

I have a custom object that we'll call "MyObject" It has three major properties called X,Y, and Z that determine whether it's unique or not. I have a HashSet containing 400,000 "MyObject"s in a HashSet. My original solution for generating a unique hashcode was simple and fast.

return Convert.ToInt32(X * 76 + Y * 100 + Z * 23);

However, the integer generated from this was not unique enough. With the current HashCode, these two points match, even though the Y is slightly different.

X: 392598.200000000190 Y: 4935367.900000000400

X: 392598.200000000190 Y: 4935367.900580000100

What I've tried:

double value = (X * 101 + Y * 89 + Z * 56);
return value.GetHashCode();
  • Extremely accurate, with 1 - 10,000 records, it only takes a few seconds to compute the differences. However, with 400,000 records, it bogs down hard. I let it run for 17 hours, and it still hadn't returned my results.
  • Converting to a string, then getting hashcode of the string. Precise, but uselessly slow.
  • Increasing the multipliers of X, Y and Z. The number generated becomes too large. I tried using the method used here: http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx

    return ((int)value ^ (int)(value >> 32));
    

However it doesn't allow for anymore integers. I also worry that even if I increase the size, it may becomes uselessly slow like my other solutions.

I can't do additional checks if it matches, since 390,000 of 400,000 records will likely match

What is the best solution? Or is there a way to make my two already precise operations significantly faster? I was thinking about removing all zeros from the values after the decimal place until it meets a non-zero, and then using my original logic, ie (45.0002030 would become 45.2030)

Evan Parsons
  • 1,139
  • 20
  • 31
  • Not really an answer, but might help as a minor optimization: if `X`, `Y`, and `Z` are properties perhaps build the hash by referencing the backing fields instead. Also, are you testing/benchmarking under "release" mode? – Chris Sinclair Jun 11 '13 at 13:01
  • Hi Chris, not sure what you mean by release model? – Evan Parsons Jun 11 '13 at 13:18
  • "release" configuration as opposed to "debug". In "debug", compiler optimizations are not applied, and especially with the debugger attached a whole bunch of extra changes and lack of optimizations are applied. http://stackoverflow.com/questions/367884/debug-release-difference – Chris Sinclair Jun 11 '13 at 13:30
  • I'm running it on a straight up local IIS – Evan Parsons Jun 11 '13 at 13:43
  • Right-click on your solution in the solution explorer, select "Properties". Then on the left click "Configuration Properties -> Configuration". At the top-left, check that the "Configuration: " drop-down-list is set to "Release". http://stackoverflow.com/questions/12597252/why-the-web-site-project-does-not-have-release-mode-in-config – Chris Sinclair Jun 11 '13 at 14:11

1 Answers1

4

You can easily calculate a reasonable hash code from several objects like so:

public override int GetHashCode()
{
    int hash = 17;

    hash = hash * 23 + X.GetHashCode();
    hash = hash * 23 + Y.GetHashCode();
    hash = hash * 23 + Z.GetHashCode();

    return hash;
}

You can add as many hash codes to this as you like as you add new fields to your class that must contribute to the hash code.

This is generally a fast operation.

Also note that if you have immutable types, you can speed things up by either calculating the hash code in the immutable type's constructor or by lazily calculating it on demand (and then caching the result).

[EDIT]

Where you saw your code slowing down, are you sure that wasn't because you were getting a lot of hashcode collisions rather than the hashcode calculation itself being too slow?

For example, if you just returned 0 for every hash code, it would be very fast but adding to the hash collection would be extremely slow after a while.

I would expect the time taken to calculate the hash codes like this would be dwarfed by the amount of time taken actually adding the items to the collection.

[Second Edit]

The implementation of double.GetHashCode() (obtained via Reflector) is:

public override unsafe int GetHashCode()
{
    double num = this;
    if (num == 0.0)
    {
        return 0;
    }
    long num2 = *((long*) &num);
    return (((int) num2) ^ ((int) (num2 >> 32)));
}

which looks pretty quick to me.

Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • Anytime I convert a double to it's hashcode, it slows right down.The X,Y, and Z values are generated from reading a binary file. What I find weird is that there is no speed penalty at all when I read 1-10,000 records. As soon it hits 400,000 my app becomes useless. Are you saying that my speed penalty might be from generating the HashSet of 400,000 objects and not the comparing part? – Evan Parsons Jun 11 '13 at 13:15
  • I looked up immutable types, would implementing this make a drastic difference? (Difference between 24 hours and let's say 3 minutes) Or is it just a micro-improvement? – Evan Parsons Jun 11 '13 at 13:18
  • @EvanParsons In the case that calculating the hash codes is by far the slowest thing, then yes it would make a noticeable difference. BUT I think your slowness is coming from somewhere else (possibly as a result of too many hash code collisions). – Matthew Watson Jun 11 '13 at 13:26
  • my code doesn't do anything unless there isnt a match, `if (modPointZList.Contains(pointz)) { foundIt = true; } if (!foundIt) { notFoundRecords.Add(pointz); }` I will try experimenting... open to any ideas on generating a hashcode that doesn't involve getting the hashcode of a double. – Evan Parsons Jun 11 '13 at 13:46
  • @EvanParsons Double's GetHashCode() is pretty quick though. I'm fairly sure your slowdown is elsewhere. For completeness, I'm posting the implementation of `double.GetHashCode()` – Matthew Watson Jun 11 '13 at 14:01
  • I converted X,Y,Z, and M to readonly, (The less important properties are still normal variables) and it's still slow. They are generated when the object is created, so I believe you might be correct. – Evan Parsons Jun 11 '13 at 14:05
  • `modPointZList` is the `HashSet` I presume? – Chris Sinclair Jun 11 '13 at 14:13
  • Hi Matt, you may be onto something. I checked the count of notfoundrecords and it appears to have a count of 166,937. This is incorrect to my knowledge. – Evan Parsons Jun 11 '13 at 14:15
  • The issue I'm running into now, is that even with identical X,Y and Z coordinates, I'm receiving a different hashcode. This doesn't even seem possible. I outputted the records in question, and the X,Y and Z are all the same... There's several records like this out of 400,000. I could do a full comparison check and check individual parameters with a loop, but I'd rather generate the correct hashcode. – Evan Parsons Jun 11 '13 at 16:11
  • @EvanParsons It not only doesn't seem possible - it's *not* possible! You can see the code for `double.GetHashCode()` and it's simply not possible for identical values to produce different hash codes. But these are doubles so are you sure they aren't really 0.000000000000001 different? – Matthew Watson Jun 11 '13 at 16:27
  • Old Values: HashCode: -1303002189 X: 737923.057 Y: 5104426.825 Z: 0 M: 0 New Values: HashCode: -1301660595 X: 737923.057 Y: 5104426.825 Z: 0 M: 0 – Evan Parsons Jun 11 '13 at 17:06
  • @EvanParsons Well, you've only shown those numbers to 3 decimal places. There's probably a load of other decimal places after those that you can't see. – Matthew Watson Jun 11 '13 at 17:15
  • I rounded the decimal places off to five spots and that seems to fix it. I'd imagine outputting a double such as `result += "dblVal" + dblVal;` ignores the 000000000004 in 31.04000000000004 and just keeps 31.04? – Evan Parsons Jun 11 '13 at 17:24
  • Everything looks like it's working as it should, I appreciate the help! – Evan Parsons Jun 11 '13 at 17:44
  • 1
    @EvanParsons Aye, when you display a double as a string, by default it doesn't show all the decimal places. Glad it seems to be working now. :) – Matthew Watson Jun 11 '13 at 17:51