6

When working with HashSets in C#, I recently came across an annoying problem: HashSets don't guarantee unicity of the elements; they are not Sets. What they do guarantee is that when Add(T item) is called the item is not added if for any item in the set item.equals(that) is true. This holds no longer if you manipulate items already in the set. A small program that demonstrates (copypasta from my Linqpad):

void Main()
{
    HashSet<Tester> testset = new HashSet<Tester>();
    testset.Add(new Tester(1));
    testset.Add(new Tester(2));
    foreach(Tester tester in testset){
      tester.Dump();
    }
    foreach(Tester tester in testset){
      tester.myint = 3;
    }
    foreach(Tester tester in testset){
      tester.Dump();
    }
    HashSet<Tester> secondhashset = new HashSet<Tester>(testset);
    foreach(Tester tester in secondhashset){
      tester.Dump();
    }
}

class Tester{
  public int myint;

  public Tester(int i){
    this.myint = i;
  }

  public override bool Equals(object o){
    if (o== null) return false;
    Tester that = o as Tester;
    if (that == null) return false;
    return (this.myint == that.myint);
  }

  public override int GetHashCode(){
    return this.myint;
  }

  public override string ToString(){
    return this.myint.ToString();
  }
}

It will happily manipulate the items in the collection to be equal, only filtering them out when a new HashSet is built. What is advicible when I want to work with sets where I need to know the entries are unique? Roll my own, where Add(T item) adds a copy off the item, and the enumerator enumerates over copies of the contained items? This presents the challenge that every contained element should be deep-copyable, at least in its items that influence it's equality.

Another solution would be to roll your own, and only accepts elements that implement INotifyPropertyChanged, and taking action on the event to re-check for equality, but this seems severely limiting, not to mention a whole lot of work and performance loss under the hood.

Yet another possible solution I thought of is making sure that all fields are readonly or const in the constructor. All solutions seem to have very large drawbacks. Do I have any other options?

Joe
  • 46,419
  • 33
  • 155
  • 245
Martijn
  • 11,964
  • 12
  • 50
  • 96
  • I'm not sure I understand the question... you want to know which kind of collection guarantees that no two items inside it are equal ? – Alex Jul 10 '12 at 10:11
  • Tim, it does, you might have to scroll down – Martijn Jul 10 '12 at 10:16
  • 4
    @Martijn: I've overlooked the `GetHashCode`. Anyway, he should read this blog from Eric Lippert: http://blogs.msdn.com/b/ericlippert/archive/2011/02/28/guidelines-and-rules-for-gethashcode.aspx It explains the rules and guidelines of it. For example: _"Guideline: the integer returned by GetHashCode should never change"_ and _"Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable"_ – Tim Schmelter Jul 10 '12 at 10:20
  • @Tim, thanks! that's pretty much where things are going wrong. This also means that once instantiated and added to a collection, an object can never change identity. The real problem isn't with the hash code, it's with the equals. I could change this trivialy to `public int GetHashCode() { return 0 }` and the hashcode would be unchanging, but the problem remains the same. – Martijn Jul 10 '12 at 11:01
  • Your 'roll your own' solution involving `INotifyPropertyChanged` points to the real problem here: what should the collection do when one of its elements is mutated to become 'identical' to another? Throw one out? Which one? You need to define semantics before you can look for a solution. – AakashM Jul 10 '12 at 12:58
  • @AakashM The most sensible thing in that scenario would IMO be to throw some exception and let the calling code handle it. – Martijn Jul 10 '12 at 14:44
  • @Martijn that's what I mean about needing to define semantics. Consider some unrelated and unknowing piece of code that happens to get handed a reference to an object and happens to want to change it - this code can sometimes receive an exception when **setting a property**, depending on whether the object is in a `StrictSet`?? *Very* surprising behaviour, wouldn't you agree? – AakashM Jul 10 '12 at 15:12
  • @AakashM quite, but not as surprising as ending up with duplicates in a datastructure I expect to exclude duplicates – Martijn Jul 10 '12 at 18:24

3 Answers3

6

You're really talking about object identity. If you're going to hash items they need to have some kind of identity so they can be compared.

  • If that changes, it is not a valid identity method. You currently have public int myint. It really should be readonly, and only set in the constructor.
  • If two objects are conceptually different (i.e. you want to treat them as different in your specific design) then their hash code should be different.
  • If you have two objects with the same content (i.e. two value objects that have the same field values) then they should have the same hash codes and should be equal.
  • If your data model says that you can have two objects with the same content but they can't be equal, you should use a surrogate id, not hash the contents.
  • Perhaps your objects should be immutable value types so the object can't change
  • If they are mutable types, you should assign a surrogate ID (i.e. one that is introduced externally, like an increasing counter id or using the object's hashcode) that never changes for the given object

This is a problem with your Tester objects, not the set. You need to think hard about how you define identity. It's not an easy problem.

Joe
  • 46,419
  • 33
  • 155
  • 245
  • So you would take some sort of option 3: only use HashSets that have immutable identities? How would you guarantee that? – Martijn Jul 10 '12 at 10:18
  • 1
    That's a whole other question. Read this: http://stackoverflow.com/questions/750947/net-unique-object-identifier – Joe Jul 10 '12 at 10:24
  • 1
    "If you have more than one object that has a given identity, it is not a valid identity method" -- true (I think), but I misread this at first: it would help if you could rephrase this in a way that doesn't suggest to the careless reader that two different objects must return different hash codes from `GetHashCode` -- that is of course perfectly valid. –  Jul 10 '12 at 10:33
  • "If two objects are conceptually different (i.e. you want to treat them as different in your specific design) then their hash code should be different."? No, that's how I hoped I had misread at first, but that's wrong. Two different objects can have the same hash code and not compare equal. And a HashSet has no problems with that. For a simple example, there are only 2**32 possible hash codes, do you expect all hash codes for all valid values of an Int64 to be different? –  Jul 10 '12 at 10:57
  • The conclusion here is that you can't have *mutable* objects that define meaninginfull equality based on the values of their properties. Unfortunately, there are no constructs in the language that help you guarantee that. – Martijn Jul 10 '12 at 11:08
  • 1
    The conclusion is that if you values change then you can't use them to define identity. There are strong guidelines and a hell of a lot of framework but they can't stop you shooting yourself in the foot. Some people have valid use cases for shooting themselves in the foot. – Joe Jul 10 '12 at 11:11
  • @Joe, to indicate why this is much easier sounding than it is in practice, how would you implement equality on a mutable IntegerContainer while following the guidelines? (as far as I can see, this is impossible) – Martijn Jul 10 '12 at 11:18
  • Ah now Integers are different. An Integer is a value type. its identity is itself. The `Integer` class (a boxed primitive integer) is immutable for that reason. If you have objects that can change integer values, then two objects which have the same value would then become 'equal'. What would you expect to happen? In this case I would advise a surrogate id. – Joe Jul 10 '12 at 11:30
  • Quickly thinking this through, this problem holds true for *all* reference types that do not have an equals based on the reference identity (i.e. a.equals(b) could return something else than ReferenceEquals(a, b) ). Consider the same problem I have above arises when instead of Tester I'd used `List` (for any Foo) and thus `HashSet> testset;`. after adding lists that contain different Foo instances, I call `foreach(List list in testset){ list.Clear(); }`. The result will be the same as with my Tester class. (I'll edit the anwer to include this example). – Martijn Jul 10 '12 at 15:13
  • It seems the .NET team has set the example that *all objects that implement value equality* (like I did with the Tester class) *should be immutable*. (like I neglected with the Tester class) Contrast with Java: http://docs.oracle.com/javase/1.5.0/docs/api/java/util/List.html#equals(java.lang.Object) where Lists are equal iff their elements are. – Martijn Jul 10 '12 at 15:41
  • The philosophy for .NET and Java are broadly similar here. Equality of ICollections / Collections is equal to the equality of their contents in both cases IIRC (bit rusty). – Joe Jul 10 '12 at 15:48
  • I just checked, in .NET equality of collections (e.g. a list) does not depend on the equality of the elements, but on the reference. Lists with equal elements are not equal in .NET (but are in Java). – Martijn Jul 10 '12 at 18:01
  • Well remember the original question concerned the equality of mutable data structures. A Collection is one such example, just one level up. Thanks for updating the thread, but we're getting a little bit off the original topic. – Joe Jul 10 '12 at 19:14
0

When I need a 1-dimensional collection of guaranteed unique items I usually go with Dictionary<TKey, Tvalue>: you cannot add elements with the same Key, plus I usually need to attach some properties to the items and the Value comes in handy (my go-to value type is Tuple<> for many values...).

OF course, it's not the most performant nor the least memory-hungry solution, but I don't usually have performance/memory concerns.

Alex
  • 23,004
  • 4
  • 39
  • 73
0

You should implement your own IEqualityComparer and pass it to the constructor of the HashSet to ensure you get the desired equality comparer.

And as Joe said, if you want the collection to remain unique even beyond .Add(T item) you need to use ValueObjects that are created by the constructor and have no publicly visible set attributes. i.e.

M Afifi
  • 4,645
  • 2
  • 28
  • 48