13

Again this sample is a very simplified version of my actual problem involving a custom comparer for linq grouping. What have I done wrong?

The code below produces the result below (1.2, 0), (4.1, 0), (4.1, 0), (1.1, 0),

however I was expecting the following since 1.1 and 1.2 are < 1.0 apart. (1.2, 0), (1.1, 0), (4.1, 0), (4.1, 0),

class Program
{
    static void Main(string[] args)
    {
        IEnumerable<Point> points = new List<Point> { 
            new Point(1.1, 0.0)
            , new Point(4.1, 0.0) 
            , new Point(1.2, 0.0)
            , new Point(4.1, 0.0)
        };

        foreach (var group in points.GroupBy(p => p, new PointComparer()))
        {
            foreach (var num in group)
                Console.Write(num.ToString() + ", ");

            Console.WriteLine();
        }

        Console.ReadLine();
    }
}

class PointComparer : IEqualityComparer<Point>
{
    public bool Equals(Point a, Point b)
    {
        return Math.Abs(a.X - b.X) < 1.0;
    }

    public int GetHashCode(Point point)
    {
        return point.X.GetHashCode()
            ^ point.Y.GetHashCode();
    }
}

class Point
{
    public double X;
    public double Y;

    public Point(double p1, double p2)
    {
        X = p1;
        Y = p2;
    }

    public override string ToString()
    {
        return "(" + X + ", " + Y + ")";
    }
}
DustyB
  • 145
  • 1
  • 1
  • 8
  • 1
    I don't think you can use group by as a solution to cluster points. One reason is that GetHashcode must return the same hash for items that are equal. – Maciej Grzyb Jun 09 '16 at 18:45
  • I know it's not been done here, but one thing to beware of is: don't use `PointComparer.Default` instead of `new PointComparer()` - in spite of what the documentation says, it doesn't create a new `PointComparer` and instead creates an `ObjectEqualityComparer\`1`. – Matt Arnold Apr 27 '21 at 12:53

2 Answers2

19

The grouping algorithm (and I think all LINQ methods) using an equality comparer always first compares hash codes and only executes Equals if two hash codes are equal. You can see that if you add tracing statements in the equality comparer:

class PointComparer : IEqualityComparer<Point>
{
    public bool Equals(Point a, Point b)
    {
        Console.WriteLine("Equals: point {0} - point {1}", a, b);
        return Math.Abs(a.X - b.X) < 1.0;
    }

    public int GetHashCode(Point point)
    {
        Console.WriteLine("HashCode: {0}", point);
        return point.X.GetHashCode()
            ^ point.Y.GetHashCode();
    }
}

Which results in:

HashCode: (1.1, 0)
HashCode: (4.1, 0)
HashCode: (1.2, 0)
HashCode: (4.1, 0)
Equals: point (4.1, 0) - point (4.1, 0)
(1.1, 0), 
(4.1, 0), (4.1, 0), 
(1.2, 0), 

Only for the two points with equal hash codes Equals was executed.

Now you could trick the comparison by always returning 0 as hash code. If you do that, the output will be:

HashCode: (1.1, 0)
HashCode: (4.1, 0)
Equals: point (1.1, 0) - point (4.1, 0)
HashCode: (1.2, 0)
Equals: point (4.1, 0) - point (1.2, 0)
Equals: point (1.1, 0) - point (1.2, 0)
HashCode: (4.1, 0)
Equals: point (4.1, 0) - point (4.1, 0)
(1.1, 0), (1.2, 0), 
(4.1, 0), (4.1, 0), 

Now for each pair Equals was executed, and you've got your grouping.

But...

What is "equal"? If you add another point (2.1, 0.0), which points do you want in one group? Using the symbol for the fuzzy equality, we have -

1.1 ≈ 1.2
1.2 ≈ 2.1

but

1.1 !≈ 2.1

This means that 1.1 and 2.1 will never be in one group (their Equals never passes) and that it depends on the order of the points whether 1.1 or 2.1 are grouped with 1.2.

So you're on a slippery slope here. Clustering points by proximity is far from trivial. You're entering the realm of cluster analysis.

Gert Arnold
  • 105,341
  • 31
  • 202
  • 291
  • This will take some thought. I tried your suggestion of always returning hash code 0 and with my actual data (not in the sample) it works well enough. I'll have to analyze where it might fail. – DustyB Jun 09 '16 at 20:22
  • 1
    One way to get some kind of regularity (predictable results) is to always group points ordered the same way before grouping (for instance X first, then Y). – Gert Arnold Jun 09 '16 at 20:25
5

Don't forget the effects of GetHashCode. There is an expectation that GetHashCode will always return the same value for any two objects for each Equals would return true. If you fail to meet that expectation, you'll get unexpected results.

Specifically, GroupBy probably uses something like a hash table to allow it to group items together without having to compare every item with every other item. If GetHashCode returns a value that doesn't end up putting two objects into the same bucket of the hash table, it'll assume that they're not equal, and will never try to call Equals on them.

You will find, as you try to figure out a correct implementation for GetHashCode that there's a fundamental problem with how you're trying to group your objects. What would you expect if you had points with x-values of 1.0, 1.6, and 2.2? 1.0 and 2.2 are too far from one another to fall into the same group, but 1.6 is close enough to both other points that it should be in the same group with them. So your Equals method breaks the Transitive property of equality:

whenever A = B and B = C, then also A = C

If you're trying to do cluster grouping, you're going to need to use a more different data structure and algorithm. If you're just trying to normalize the points' locations somewhat, you might be able to just say points.GroupBy(p => (int)p.X) and avoid the equality comparer entirely.

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • Thata is exactly what I'm seeing, my GetHasCode has a significant impact on this problem. Any change I've made to it results in change in my output. What should my GetHashCode method look like? – DustyB Jun 09 '16 at 18:50
  • 1
    @DustyB: See my updated answer. Your current definition for what makes two items "equal" is untenable. You have to come up with a more concrete idea of what you're trying to group items by, or use a different data structure and algorithm that's based on clustering rather than equality. – StriplingWarrior Jun 09 '16 at 18:55
  • My actual problem uses 3D points and I want to cluster them such that points within 10 units and on the same plane group together. Is it feasible to do that with a custom comparer? – DustyB Jun 09 '16 at 19:00
  • 1
    @DustyB: Not reliably. If point A and B are within 10 units of each other, and B and C are within 10 units of each other, but A is not within 10 units of C, then should they all be in a group together or not? No matter how you answer that question, any grouping algorithm you choose based purely on *equality* is going to fail to produce your desired output under some circumstances. – StriplingWarrior Jun 10 '16 at 19:24