8

This question is similar to the one here.

We all know what PointF is, don't we? This is the data structure:

public struct PointF
{
  public float X;
  public float Y;
}

How to implement IEqualityComparer<PointF> with tolerance? Let's say my Equals code is like this

public const float Epsilon = 0.01; //say
public bool Equals(PointF pt1, PointF pt2)
{
   return Math.Abs(pt1.X-pt2.X)<Epsilon && Math.Abs(pt1.Y-pt2.Y)<Epsilon;
}

Question: How to implement the correct GetHashCode so that for a dictionary of PointF, I will access the element correctly?

I crack my head a few days but still can't find a satisfactory solution.

Community
  • 1
  • 1
Graviton
  • 81,782
  • 146
  • 424
  • 602
  • 1
    You want something like: for every point P,Q where P =~ Q then Hash(P) == Hash(Q). Right? (=~ means equals with tolerance). I guess this is really hard to get right. You might even try at MathOverflow to get find a function that will have that property – Vinko Vrsalovic Jan 13 '10 at 08:59
  • @Vinko, yes, that's what I want. – Graviton Jan 13 '10 at 09:02

3 Answers3

14

Instead of defining the tolerance by the distance, you could place the points in a grid.
If two points are in the same cell, they're considered equal and have the same hash code.

public bool Equals(PointF pt1, PointF pt2)
{
   return GetCell(pt1.X) == GetCell(pt2.X)
       && GetCell(pt1.Y) == GetCell(pt2.Y);
}

public int GetHashCode(PointF pt)
{
   return GetCell(pt.X) ^ GetCell(pt.Y);
}

private static int GetCell(float f)
{
    return (int)(f / 10); // cell size is 10 pixels
}

Thesis: There is no implementation of Equals and GetHashCode that meets your requirements.

Proof: Consider the following three points, A, B, and C:

Illustration

As per your requirements,

Equals(A, B) == true              // (i)
Equals(B, C) == true              // (ii)
Equals(A, C) == false             // (iii)
GetHashCode(A) == GetHashCode(B)  // (iv)
GetHashCode(B) == GetHashCode(C)  // (v)
GetHashCode(A) != GetHashCode(C)  // (vi)

But from (iv) and (v) follows

GetHashCode(A) == GetHashCode(C)

and thereby

Equals(A, C) == true

which contradicts (iii) and (vi).

Since Equals and GetHashCode cannot return different values for the same arguments, there is no implementation that meets your requirements. q.e.d.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
dtb
  • 213,145
  • 36
  • 401
  • 431
  • 1
    This is a *no-no*, let's assume that two points are in different grids but they are indefinitely close to each other, they are already not the same by your definition, whereas by any other definition they are the same. – Graviton Jan 13 '10 at 09:01
  • @Ngu That's a no-no in your particular use case, there are plenty of valid use cases for grids. – Vinko Vrsalovic Jan 13 '10 at 09:11
  • +1 Probably the only easily implemented solution that would give predictable results. When comparing a couple of points this and original (author) way can give different output (different cells, points close to two neighbor cell border while tolerance distance is acceptable) but in large quantity of points dtb solution would be deterministic. – Audrius Jan 13 '10 at 09:12
  • If you have multiple sets of points, then give them a container that tracks the origin for the grid, that way you can adjust for the different coordinate spaces (and indeed tolerances) so that this approach actually works. The only other approach you have is to deep scan the whole world each time to find those within the tolerance. One optimisation from that would then to bisect groups of points into squares or polygonal regions based on locality. – Andras Zoltan Jan 13 '10 at 09:28
  • loving the graphics there dtb :) – Andras Zoltan Jan 13 '10 at 09:32
  • 1
    Your proposition (vi) is incorrect... there's no requirement for unequal objects to have different hash-codes. A valid implementation of `GetHashCode` would be `return 0`! – Ben Lings Jun 21 '10 at 16:04
  • @Ben Lings: Good catch. However, the point of implementing GetHashCode is to be able to use the data type as key in a dictionary in order to get (almost) constant lookup complexity. If GetHashCode returns the same hash code for all values (and there is no other valid implementation), lookup complexity would become linear, which defeats the purpose of using a dictionary. You could equally use a list of key-value-pairs then instead and wouldn't need a GetHashCode implementation at all. So, while you're technically right, your insight does not open up any new way to solve the problem. :-) – dtb Jun 21 '10 at 18:01
  • 1
    Even though one could resolve (vi) by having `GetHashCode` return a constant, a proper implementation of `Equals` is supposed to behave as an equivalence relation, meaning that if `A` equals `B`, and `B` equals `C`, then `A` must equal `C`, contrary to (iii) above. On the other hand, the grid approach is a good one if, the grid spacing is such that any point within one's tolerance can be narrowed down to a few grid locations (preferably two). To check if a collection contains any points near a specified point, look for points that map to any of its nearby grid locations, and... – supercat Oct 08 '12 at 23:10
  • ...then check all such points for distance to the specified point. For example, if one has a tolerance of +/- 0.5, and the grid spot for a number is the most positive integer which is not larger than that number, then any point within 1.0 units of `X` must be located in the grid spot associated with `X-0.5` or the one associated with `X+0.5`. If only a few points are in either of those grid spots, the speed benefits of using a hashed collection may outweigh the need to perform two separate lookups. – supercat Oct 08 '12 at 23:13
1

I don't think it's possible because you could have an infinite sequence of values that are equal (within tolerance) to the previous and next value in the sequence but not any other value and GetHashCode would need to return an identical value for all of them.

Pent Ploompuu
  • 5,364
  • 1
  • 27
  • 47
  • I had never thought of that. Still would have liked a solution to the tolerance problem since I doubt there would be a possibility for an infinite sequence in a sparse data set with low tolerance. – Mateen Ulhaq Jan 21 '16 at 23:17
0

Well, the answer based on grids is good, but sometimes you need to group the close points anyway, even if they are not in the same grid cell. My approach is to implement this with a grouping: two points are in the same group if either they are close or there is a sequence of close points connecting them. This semantics cannot be done with a proper IEqualityComparer, because it needs to know all the items in advance before producing the groups. So I've done a simple LINQ-style operator GroupByCluster, which basically achieves this.

The code is here: http://ideone.com/8l0LH. It compiles on my VS 2010, but fails to compile on Mono because HashSet<> cannot be implicitly converted to IEnumerable<> (why?).

The approach is generic and thus not very efficient: it's quadratic on input size. For the concrete types it can be made more efficient: for example, for T = double we can just sort the input array and have O(n log n) performance. The analogous though more complicated trick is applicable for 2D points as well.


Note aside: your initial proposition is impossible to implement with IEqualityComparer, since your "approximate equality" is not transitive (but the equality in IEqualityComparer has to be so).

Vlad
  • 35,022
  • 6
  • 77
  • 199