Using C# HashSet to solve problems where equal is not equal

Question

I'm basing this on performance characteristics I've recently found out about Dictionary, so I'm using Dictionary<type, bool> where the bool is ignored but supposedly I could use HashSet instead.

For example:

Dictionary<bounds, bool> overlap;

class bounds
{
    public float top_left_x, top_left_y, width, height;

    public bool equal(bounds other)
    {
        return upper_left_x + width > other.upper_left_x &&
        upper_left_x < other.upper_left_x + other.width &&
        upper_left_y + height > other.upper_left_y &&
        upper_left_y < other.upper_left_y + other.height;
    }

    public ... GetHashCode()
    {
        ...;
    }
}

Here I'm not using equal to check for equality but instead for overlapping, which is bound to be annoying elsewhere but there is a reason why I'm doing it.

I'm presuming that if a value can be looked up from a key in O(1) time then so can a key from itself.

So I could presumably put thousands of bounds into overlap and do this:

overlap.ContainsKey(new bounds(...));

To find out in O(1) time if a given bound overlaps any others from the collection.

I'd also like to know what happens if I change the (x, y) position of a bound, presumably it's like removing and then adding it into the set again performance wise, very expensive?

What do I put into the GetHashCode function?

goal

If this works then I'm after using this sort of mechanism to find out what other bounds a given bound overlaps.

Very few bounds move in this system and no new ones are added after the collection has been populated. Newly added bounds need to be able to overlap old ones.

conclusion

See the feedback below for more details.

In summary it's not possible to achieve the O(1) performance because, unlike the default equals, a check for overlapping is not transitive.

An interval tree however is a good solution.

This won't work if you try to add bounds which overlap with any of the already inserted bounds. It would, though, if you create a derived class which does the overlapping check (assuming that you create a good hashcode) — Nuffin, Jan 06 '12 at 23:27
Thanks. They do need to be able to. Can you give an answer with an example of this in it? — alan2here, Jan 06 '12 at 23:33
Related question: http://stackoverflow.com/questions/6774841/implementing-a-geographic-coordinate-class-equality-comparison — CodesInChaos, Jan 07 '12 at 10:59

Eric Lippert · Answer 1 · 2012-01-07T00:03:32.960

11

The equality relation is completely the wrong relation to use here because equality is required to be an equivalence relation. That is, it must be reflexive -- A == A for any A. It must be symmetric -- A == B implies that B == A. And it must be transitive -- if A == B and B == C then A == C.

You are proposing a violation of the transitive property; "overlaps" is not a transitive relation, therefore "overlaps" is not an equivalence relation, and therefore you cannot define equality as overlapping.

Rather than trying to do this dangerous thing, solve the real problem. Your goal is apparently to take a set of intervals, and then quickly determine whether a given interval overlaps any of those intervals. The data structure you want is called an interval tree; it is specifically optimized to solve exactly that problem, so use it. Under no circumstances should you attempt to use a hash set as an interval tree. Use the right tool for the job:

http://wikipedia.org/wiki/Interval_tree

edited Jan 07 '12 at 00:03

answered Jan 06 '12 at 23:58

Eric Lippert

647,829
179
1,238
2,067

I don't like creating something that advertises differently than it behaves, for example called equals but actually does something different. I'm only doing so to exploit the O(1) performance aspect, there are a lot of bounds to check in a short time. The problem is that perhaps it's impossible anyway. The end goal is to know what 2D intervals overlap a given 2d interval. Thanks for the interval tree idea. I've not heard of those before. Is there one built into C#? – alan2here Jan 07 '12 at 00:08
4

@alan2here: The O(1) behaviour is *predicated* on equality being an equivalence relation. The whole point of the hash set algorithm is that it takes advantage of the fact that equality is always an equivalence relation to get good performance. If you violate that requirement then either the results will be wrong or slow. – Eric Lippert Jan 07 '12 at 00:13
3

@alan2here: 2-d rectangle overlapping is a straightforward extension of 1-d interval overlapping. You can solve the 2-d problem (or the n-d problem, for that matter) with nested interval trees; see the wikipedia page for a sketch. – Eric Lippert Jan 07 '12 at 00:16
@alan2here: incidentally, if you are interested in solving a related problem -- how to get closest match to a set of points in a high-dimensional vector space -- see my articles that sketch out how to so so: http://blogs.msdn.com/b/ericlippert/archive/tags/high+dimensional+spaces/. If you are interested in ways that violating transitivity or other properties of equality and ordering relations can cause errors, see my series of articles that starts here: http://blogs.msdn.com/b/ericlippert/archive/2011/01/20/spot-the-defect-bad-comparisons-part-one.aspx – Eric Lippert Jan 07 '12 at 00:22
I'm so late to the party, but if anyone's curious about closest match in high dimensional spaces.... think about high dimensional cubes :) – mhand Aug 01 '21 at 23:50

score 8 · Accepted Answer · answered Jan 06 '12 at 23:34

8

Here I'm not using equal to check for equality but instead for overlapping, which is bound to be annoying elsewhere but there is a reason why I'm doing it.

I'm assuming this means you will have a scenario where A.Equals(B) is true, B.Equals(C) is true, but A.Equals(C) is false. In other words, your Equals is not transitive.

That is breaking the rules of Equals(), and as a result Dictionary will not work for you. The rule of Equals/GetHashCode is (from http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx):

If two objects compare as equal, the GetHashCode method for each object must return the same value.

If your Equals is not transitive, then you can't possibly write a valid GetHashCode.

answered Jan 06 '12 at 23:34

Joe Strommen

1,236
10
18

2

You can write a _valid_ one (everything will work totally fine if you just return 0)... but a _good_ one is totally impossible. – Nuffin Jan 06 '12 at 23:37
@Tobias o.O, won't that result in the Dictionary/HashSet just beeing a glorified List who's probably even less optimized at doing what it's trying to do? – Alxandr Jan 06 '12 at 23:41
:¬( Is there some way to achieve the same behaviour and performance characteristics using a similar approach but thats not quite the same as I described? – alan2here Jan 06 '12 at 23:42
@Alxandr, yes, it will result in the hash set performing as a linked list. Conclusion: the OP *can* do what he wants, but he *shouldn't*. – phoog Jan 06 '12 at 23:47
Are you saying that if I were to implement the above as I have described then it would only perform as well as a linked list? Doesn't the default equals take a constant time to complete, the same as my custom equals function? But somehow it works at O(1) with the faster but still slightly costly default equals. Is the difference all in the transitiveness. – alan2here Jan 06 '12 at 23:49
@alan2here: Both your and the default equals evaluate at O(1), but the problem is that the constant hascode will cause all items to be inserted at the same place, thus negating the advantage of hascode calculation (which causes the normal almost O(1) runtime complexity) – Nuffin Jan 06 '12 at 23:51
My custom hash code calculation is more constant than the default one? – alan2here Jan 07 '12 at 00:00
1

@alan2here: Follow the logic. (1) if A equals B then A.GHC()==B.GHC() must be true by the rules of GetHashCode. You can have a situation where A equals B, B equals C but A does not equal C. By the rule just stated, A.GHC()==B.GHC() and B.GHC() == C.GHC(), and therefore A.GHC()==C.GHC() even though A does not equal C. Follow this through to its logical conclusion: **every call to GHC must return the same number**. Since every call returns the same number, and the hash table's performance depends on the result being well-distributed, the performance will be bad. – Eric Lippert Jan 07 '12 at 00:09
Thanks Eric Lippert, Alxandr, phoog and Tobias. I appreciate the explanation. – alan2here Jan 07 '12 at 00:16

Nuffin · Answer 3 · 2012-01-06T23:56:37.833

If you use the derived class approach I mentioned above, you need the following:

public class Bounds
{
    public Point position;
    public Point size; // I know the width and height don't really compose
                       // a point, but this is just for demonstration

    public override int GetHashCode(){...}
}

public class OverlappingBounds : Bounds
{
    public override bool Equals(object other)
    {
        // your implementation here
    }
}

// Usage:
if (_bounds.ContainsKey(new OverlappingBounds(...))){...}

but since the GetHashCode() method needs to return always the same value, the runtime complexity will most likely be at O(n) instead of O(1).

lol ok, not that useful to me then. Thanks for providing an answer. I hope somone will find it helpful. — alan2here, Jan 07 '12 at 00:11

Elian Ebbing · Answer 4 · 2012-01-07T00:29:39.983

1

You can't use a Dictionary or HashSet to check if bounds overlap. To be able to use a dictionary (or hashset), you need an Equals() and a GetHashCode() method that meet the following properties:

The Equals() method is an equivalence relation
a.Equals(b) must imply a.GetHashCode() == b.GetHashCode()

You can't meet either of these requirements, so you have to use another datastructure: An Interval tree.

edited Jan 07 '12 at 00:29

answered Jan 07 '12 at 00:03

Elian Ebbing

18,779
5
48
56

score 0 · Answer 5 · answered Jan 06 '12 at 23:38

0

You can not guarantee O(1) performance on dictionary where you customize hashcode calculation. If I put inside GetHashCode() method some WebService request which should control for me the equality of 2 provided items, it's clear that the time can never be O(1) as expected. Ok, this is kind of "edge case", but just to give an idea.

By doing in a way you think to do (assuming that this is even possible), imo, you negate the benefits provided by Dictionary<K,V>, so the constant key recovery time also on big collections.

It need to be measured on reasonable amount of objects you have, but I would first try to use List<T> like an object holder, and make something like this:

var bounds = new List<Bound> {.... initialization... }
Bound providedBound = //something. Some data filled in it. 
var overlappedany = bounds.Any<Bound>(b=>return b.Equals(providedBound));

answered Jan 06 '12 at 23:38

Tigran

61,654
8
86
123

Will the example you have given not have to check every bound in the list, breaking the O(1) time down to at most O(n)? Also I don't know what my hash code calculation will be but the equality calculation is a constant time the same as the default equality check. – alan2here Jan 06 '12 at 23:45
O(1) means "constant time"; it doesn't mean "fast." So O(1) is generally better than O(N), but not always. Your example is one such case, but it's not an example of the dictionary failing to provide O(1) performance. A better example would be a GetHashCode() implementation that always returns the same value. – phoog Jan 06 '12 at 23:46
@alan2here: I'm talking about that when you override Equality/HashCodeCalc you breaking, basically, O(1) recovery rule from `Dictionary`, in this case, it worth to look on collection like a placeholder. That is. It's not about "good or bad", it's about "better and worse". – Tigran Jan 06 '12 at 23:49

Using C# HashSet to solve problems where equal is not equal

5 Answers5

Linked