Why is HashSet so much slower than HashSet?

Question

I wanted to store some pixels locations without allowing duplicates, so the first thing comes to mind is HashSet<Point> or similar classes. However this seems to be very slow compared to something like HashSet<string>.

For example, this code:

HashSet<Point> points = new HashSet<Point>();
using (Bitmap img = new Bitmap(1000, 1000))
{
    for (int x = 0; x < img.Width; x++)
    {
        for (int y = 0; y < img.Height; y++)
        {
            points.Add(new Point(x, y));
        }
    }
}

takes about 22.5 seconds.

While the following code (which is not a good choice for obvious reasons) takes only 1.6 seconds:

HashSet<string> points = new HashSet<string>();
using (Bitmap img = new Bitmap(1000, 1000))
{
    for (int x = 0; x < img.Width; x++)
    {
        for (int y = 0; y < img.Height; y++)
        {
            points.Add(x + "," + y);
        }
    }
}

So, my questions are:

Is there a reason for that? I checked this answer, but 22.5 sec is way more than the numbers shown in that answer.
Is there a better way to store points without duplicates?

Similar question (by me): [Why are HashSets of structs with nullable values incredibly slow?](https://stackoverflow.com/q/39391107/7586) — Kobi, Sep 12 '17 at 12:45
What are these "obvious reasons" for not using concatenated strings? What is the better way of doing it if I don't want to implement my own IEqualityComparer? — Ivan Yurchenko, Aug 19 '19 at 19:21

Hans Passant · Accepted Answer · 2017-09-13T08:19:05.457

303

There are two perf problems induced by the Point struct. Something you can see when you add Console.WriteLine(GC.CollectionCount(0)); to the test code. You'll see that the Point test requires ~3720 collections but the string test only needs ~18 collections. Not for free. When you see a value type induce so many collections then you need to conclude "uh-oh, too much boxing".

At issue is that HashSet<T> needs an IEqualityComparer<T> to get its job done. Since you did not provide one, it needs to fall back to one returned by EqualityComparer.Default<T>(). That method can do a good job for string, it implements IEquatable. But not for Point, it is a type that harks from .NET 1.0 and never got the generics love. All it can do is use the Object methods.

The other issue is that Point.GetHashCode() does not do a stellar job in this test, too many collisions, so it hammers Object.Equals() pretty heavily. String has an excellent GetHashCode implementation.

You can solve both problems by providing the HashSet with a good comparer. Like this one:

class PointComparer : IEqualityComparer<Point> {
    public bool Equals(Point x, Point y) {
        return x.X == y.X && x.Y == y.Y;
    }

    public int GetHashCode(Point obj) {
        // Perfect hash for practical bitmaps, their width/height is never >= 65536
        return (obj.Y << 16) ^ obj.X;
    }
}

And use it:

HashSet<Point> list = new HashSet<Point>(new PointComparer());

And it is now about 150 times faster, easily beating the string test.

edited Sep 13 '17 at 08:19

answered Sep 10 '17 at 17:46

Hans Passant

922,412
146
1,693
2,536

26

+1 for providing GetHashCode method implementation. Just for curiosity, how did you come with particular `obj.X << 16 | obj.Y;` implementation. – Akash KC Sep 11 '17 at 02:54
35

It was inspired by the way the mouse passes its position in windows. It is a perfect hash for any bitmap you'd ever want to display. – Hans Passant Sep 11 '17 at 06:53
2

Good to know that. Any documentation or best guideline to write hashcode like yours ? Actually, I still would like to know whether above hashcode comes with your experience or any guideline which you follow. – Akash KC Sep 11 '17 at 18:15
7

@AkashKC I'm not very experiences with C# but as far as I know integers are generally 32bits. In this case you want the hash of 2 numbers and by left-shifting one 16bits you make sure the "lower" 16 bits of each number don't "affect" the other with `|`. For 3 numbers it could make sense to use 22 and 11 as shift. For 4 numbers it would be 24, 16, 8. However there will be still collisions but only if the numbers get large. But it also crucially depends on the `HashSet` implementation. If it uses open-adressing with "bit truncation" (I don't think it does!) the left-shift approach might be bad. – MSeifert Sep 12 '17 at 07:07
@AkashKC: The official documentation is at https://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx , but that's primarily a reference. I recommend reading Eric Lippert's excellent guide at https://blogs.msdn.microsoft.com/ericlippert/2011/02/28/guidelines-and-rules-for-gethashcode/ . – Brian Sep 12 '17 at 13:03
@MSeifert : Thanks for the comment. Now, it's making sense for me and understood logic behind. With similar impplementation, if we need to hash of 6 numbers, how can it be done with this approach? – Akash KC Sep 12 '17 at 18:58
3

@HansPassant: I wonder if using XOR rather than OR in GetHashCode might be slightly better - in the event that point coordinates might exceed 16 bits (perhaps not on common displays, but near future). // XOR is usually better in hash functions than OR, since it loses less information, is reversibke, etc. // e.g. If negative coordinates are allowed, consider what happens to the X contribution if Y is negative. – Krazy Glew Sep 13 '17 at 01:18
my understanding is that the best hash code shift values (except perhaps in this case) should be prime numbers? – Jim Sep 13 '17 at 06:00
1

if `Point` is `struct`, then comparing two objects would involve reflection which can also be one of the reason for it to be slow – Ehsan Sajjad Sep 13 '17 at 08:13
I don't disagree with Krazy. It doesn't matter for bitmaps but people might be using it for other purposes. Changed. – Hans Passant Sep 13 '17 at 08:19
@EhsanSajjad: Only when Equals is not overridden. Which it is. – Joey Sep 13 '17 at 10:05
1

@HansPassant: If we want to consider "purposes other than bitmaps" for `PointComparer.GetHashCode`, should we not consider as a defect the fact that the higher 16 bits of `Y` are lost? E.g. write `(obj.Y << 16) ^ obj.X ^ (obj.Y >> 16)` – Jean Hominal Sep 13 '17 at 11:09
1

Sigh. It is no more or less of a mistake as the original Point.GetHashCode() implementation. Blindly copying somebody's custom hash function is a mistake, but is bound to happen anyway because hashing is a black art to many programmers, all I can do is limit the damage when it happens. If you think you found a better algorithm then just test it to make sure it is. But do keep in mind that you can never beat a perfect hash. Like this one, perfect because it works with bitmaps. – Hans Passant Sep 13 '17 at 11:34
1

@JeanHominal: The nice thing about going from OR to XOR is that it improves "purposes other than bitmaps" without hurting the performance for bitmaps. Further, prefering XOR over OR is generally good practice when implementing `GetHashCode`). On the other hand, your proposal has an extra operation. This negligibly reduces readability/maintainability and probably hurts performance. Given that a substantial portion of SO users will be using `Point` for bitmaps, Hans' choice to optimize for bitmaps first is a reasonable decision...especially since the OP never contradicted this decision. – Brian Sep 13 '17 at 13:09
@HansPassant - minor nit, that probably is not significant in OP's case: for small bitmaps, using obj.X without any multiplier means poor distribution over the 32-bit hash space. Maybe that doesn't matter, given .NET's prime-length hash tables? I'm still learning; I thought it worth questioning what you did. I'd use something like `return ((long)x * 805306457 + (long)y * 189783887).GetHashCode();`. [CAVEAT: I haven't tested to confirm best two primes] On the other hand, your formula is the simplest, fastest hash code I've ever seen, so maybe perfect in this specific use. :) – ToolmakerSteve Mar 01 '18 at 01:57
Excellent. Very helpful. While MSDN HashSet documentation states performance is dependent upon the quality of the hash algorithm (see Remarks Note in the link), it doesn't really say how to determine it's quality or that you're using one of poor quality. https://msdn.microsoft.com/en-us/library/xfhwa508(v=vs.110).aspx – ripvlan Feb 27 '19 at 17:52

score 86 · Answer 2 · edited Sep 16 '17 at 22:22

86

The main reason for the performance drop is all the boxing going on (as already explained in Hans Passant's answer).

Apart from that, the hash code algorithm worsens the problem, because it causes more calls to Equals(object obj) thus increasing the amount of boxing conversions.

Also note that the hash code of Point is computed by x ^ y. This produces very little dispersion in your data range, and therefore the buckets of the HashSet are overpopulated — something that doesn't happen with string, where the dispersion of the hashes is much larger.

You can solve that problem by implementing your own Point struct (trivial) and using a better hash algorithm for your expected data range, e.g. by shifting the coordinates:

(x << 16) ^ y

For some good advice when it comes to hash codes, read Eric Lippert's blog post on the subject.

edited Sep 16 '17 at 22:22

answered Sep 10 '17 at 16:27

InBetween

32,319
3
50
90

1

Sounds feasible but how have you come to this conclusion? – Martin Smith Sep 10 '17 at 16:28
2

@MartinSmith because there is no other good reason, if a hash set is slow for one type and not another under the same circumstances then it has to be due to the hash code implementation of the slow type not producing enough dispersion. – InBetween Sep 10 '17 at 16:32
4

Looking at the reference source of Point the `GetHashCode` performs: `unchecked(x ^ y)` while for `string` it looks much more complicated.. – Gilad Green Sep 10 '17 at 16:33
2

Hmm.. well, to check if your assumption is correct, I just tried using `HashSet()` instead, and used `list.Add(unchecked(x ^ y));` to add values to the HashSet. This was actually even faster than `HashSet` *(345 ms)*. Is this somehow different from what you described? – 41686d6564 stands w. Palestine Sep 10 '17 at 16:38
2

@AhmedAbdelhameed - Don't know but I have looked [here](https://referencesource.microsoft.com/#System.Drawing/commonui/System/Drawing/Point.cs,a041be61667d4c9a) – Gilad Green Sep 10 '17 at 16:45
@GiladGreen Me too, and the `GetHashCode` method does return `unchecked(x ^ y)`, but according to the test I mentioned in my last comment, this doesn't seem to be what slows things down. Or am I missing something? – 41686d6564 stands w. Palestine Sep 10 '17 at 16:50
4

@AhmedAbdelhameed that's probably because you are adding way less members to your hash set than you realize (again due to the horrible dispersion of the hash code algorithm). What's the count of `list` when you've finished populating it? – InBetween Sep 10 '17 at 16:55
4

@AhmedAbdelhameed Your test is wrong. You are adding the same longs over and over, so actually there are just few elements you are inserting. When inserting `point`, the `HashSet` will internally call `GetHashCode` and for each of those points with the same hashcode, will call `Equals` to determine if it's already exists – Ofir Winegarten Sep 10 '17 at 16:56
@OfirWinegarten, oh was I so stupid! Yes, you're absolutely right. I used @InBetween's suggestion, and created my own `point` struct, overrode `GetHashCode` with the suggested hash algorithm, and it worked like a charm. Thank you all for your help! – 41686d6564 stands w. Palestine Sep 10 '17 at 17:07
@InBetween, Yes, I was wrong about that. Sorry for the inconvenience, your answer really helped. Thank you :) – 41686d6564 stands w. Palestine Sep 10 '17 at 17:13
49

There's no need to implement `Point` when you can create a class that implements `IEqualityComparer` and keep compatibility with other things that work with `Point` while getting the benefit of not having the poor `GetHashCode` and the need to box in `Equals()`. – Jon Hanna Sep 10 '17 at 17:51
@JonHanna correct, but `Point` is simply so hideous that I tend to avoid it whenever possible. By the context I'm assuming Op only wants an immutable struct that represents coordinates (mutable types and hash sets don't mix well), so he's better off putting together his own. – InBetween Sep 10 '17 at 17:59
2

@InBetween I wouldn't bother to use `Point` unless I needed it for point stuff, in which case I need `Point`. On the plus-side, corefx has a better hash-code and an implementation of `IEquatable` which will hopefully move to netfx some day. – Jon Hanna Sep 10 '17 at 21:26
1

@BensaysNotoPoliticsonSO First, you are mixing up *buckets* with *collisions*. Second, there are *way* more collisions than two per point: the hash code range is `[0, 1023]`, thats `1024` distinct hashes. The number of points is `1 000 000`, so on average, you have `976` collisions per point. Second, due to the very low dispersion of the hash code algorithm, its alltogether possible that all `1 000 000` points are stored in the same bucket. – InBetween Sep 12 '17 at 12:17
@InBetween - You should remove the meta comment you've added to your answer. You don't need to apologize for not having the best answer. Your answer is still good. – Kobi Sep 12 '17 at 12:47

Why is HashSet so much slower than HashSet?

2 Answers2

Linked