8

I created to following code to verify the uniqueness of a series of "tuples":

struct MyTuple
{
   public MyTuple(string a, string b, string c)
   {
      ValA = a; ValB = b; ValC = c;
   }
   private string ValA;
   private string ValB;
   private string ValC;
}

...

HashSet<MyTuple> tupleList = new HashSet<MyTuple>();

If I'm correct, I will not end up with two tuples with the same values in my HashSet thanks to the fact that I'm using a struct. I could not have the same behavior with a class without implementing IEquatable or something like that (I didn't dig too much how to do that).

I want to know if there is some gotcha about what I do. Performance wise, I wouldn't expect the use of a struct to be a problem considering that the string inside are reference types.

Edit: I want my HashSet to never contains two tuples having string with the same values. In other words, I want the string to behave like values types.

Simon T.
  • 898
  • 1
  • 9
  • 16

3 Answers3

7

The gotcha is that it will not work. If two strings are "a", they can still be different references. That case would break your implementation.

Implement Equals() and GetHashCode() properly (e.g. using the ones from the supplied strings, and take care with NULL references in your struct), and possibly IEquatable<MyTuple> to make it even nicer.

Edit: The default implementation is explicitly not suitable to be used in hash tables and sets. This is clearly stated in the ValueType.GetHashCode() implementation (added emphasis):

The GetHashCode method applies to types derived from ValueType. One or more fields of the derived type is used to calculate the return value. If you call the derived type's GetHashCode method, the return value is not likely to be suitable for use as a key in a hash table. Additionally, if the value of one or more of those fields changes, the return value might become unsuitable for use as a key in a hash table. In either case, consider writing your own implementation of the GetHashCode method that more closely represents the concept of a hash code for the type.

You should always implement Equals() and GetHashCode() as "pair", and this is even more obvious since the ValueType.Equals() is terribly inefficient and unreliable (using reflection, unknown method of equality comparison). Also, there is the performance problem when not overriding those two (structs will get boxed when calling the default implementations).

Lucero
  • 59,176
  • 9
  • 122
  • 152
  • Actually, comparison of string compare values even if they are reference types. I tried it and it seems to work. – Simon T. Jan 27 '10 at 21:09
  • 1
    Uh. Note that all constant `"a"` strings in your app will have the same reference due to CLR behaviour. But try this: `string a1="a"; string a2=new string('a', 1);` put those into two instances and check if those are equal and return the same hash code with your `MyTuple` implementation. – Lucero Jan 27 '10 at 21:13
  • I just tried what you said. Each objects return a different HashCode but they are considered as identical by the HashSet since I only end up with one occurence in it!!! Those that mean HashSet doesn't use HashCode? – Simon T. Jan 27 '10 at 21:23
  • 1
    You could just use `string.Intern`. – leppie Jan 27 '10 at 21:26
  • No, it means that they seem to end up in the same bucket while the set is small and `Equals()` return true, so only one is inserted. Add some more (other) ithems to the hashset and try adding the "other" one again, it may return different results. Any way, the code is broken, don't use it. – Lucero Jan 27 '10 at 21:27
  • @leppie yeah, at the cost of filling up the intern table, which can turn out as memory leak. Bad idea. – Lucero Jan 27 '10 at 21:28
  • My previous test was broken. I removed the comment. I still believe that the code "works". Maybe I shouldn't use it but I'm not convinced yet that the results will be inconsistent. The reason for that is that two identical string with not being the same object still return the same value from `GetHashCode`: " If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code." http://msdn.microsoft.com/en-us/library/system.string.gethashcode.aspx – Simon T. Jan 27 '10 at 22:19
  • If I use a class that implement `IEqualityComparer`, implementing `Equals` is quite easy (I call `string.Equals` multiple times) but I'm not too sure about a proper implementation of `GetHashCode`. All I want is a simple tuple concept... The problem with `ValueType.GetHashCode()` seems to concern mutable value types. From the documentation, it just seems that `ValueType` implementation of `IEqualityComparer` just sucks and we shouldn't rely on it. That's a shame. – Simon T. Jan 27 '10 at 22:19
  • Yeah, both `Equals` and `GetHashCode` from `ValueType` suck. Use the `^` (XOR) operator to queue the string `GetHashCode` calls: `return ValA.GetHashCode()^ValB.GetHashCode()^ValC.GetHashCode();` (note: add null checks!) – Lucero Jan 27 '10 at 22:22
  • I'm pondering with the idea of implementing some kind of generic tuple awaiting .NET 4.0. Before doing that, maybe I could simply use a `HashTable>` or a `HashTable` but can I trust the `List` and array implementations of `Equals` and GetHashCode`? – Simon T. Jan 28 '10 at 03:46
  • @Lucero: how many strings would you need to get an OOM error? :) It's the old age speed vs space debate. – leppie Jan 28 '10 at 04:32
  • @Leppie: it just doesn't matter how many. Alone the fact that strings - for instance generated from a random identifier such as a GUID - are kept in memory is a memory leak which will lead to performance degradation and eventually break every long-running application. There's no reason to do something like this instead of properly implementing two simple methods, period. – Lucero Jan 28 '10 at 08:37
  • @Simon T: The list's and array's implementations of `Equal` and `GetHashCode` do not rely on the contents, which is a good thing. The hash code *must not* change between subsequent calls, otherwise all hash-based algorithms and data structures will fail! Therefore, the hash code must be based on an immutable part of the object. – Lucero Jan 28 '10 at 08:40
  • @Lucero: Why would you not just use a Guid as a struct member then? It all depends on the scenario and what you want, speed or space... – leppie Jan 28 '10 at 09:09
  • @Leppie: The tuple class may contain any string, therefore it is not a good idea to put those somewhere where they are unreachable for the GC to collect. Also, this is not at all a question of space and speed; the default ValueType Equals is very slow (uses reflection) and therefore overriding it is faster anyways. – Lucero Jan 28 '10 at 10:31
  • @Lucero: I'm not sure I get the memory issue. Is that related to how `ValueType` keep reference of reference on the stack? Does that mean it's unsafe to put any kind of reference type in a `ValueType`? Is there any good book to learn about those obscure gotchas? – Simon T. Jan 28 '10 at 14:15
  • @Simon: Don't worry about memory. Leppie suggested to intern the string, which is kind of a "singleton table" for strings, so that each string with the same contents also shares the same reference. This then allows comparing references to compare strings. However, strings in this table will not be collected by the GC, thus staying in memory even if the tuple instance is long gone. Over time, those instances could sum up if tuples with changing strings keep being used. This is a classical memory leak in a managed environment. – Lucero Jan 28 '10 at 14:44
3

Your approach should work, but you should make the string values read-only, as Lucero said.

You could also take a look at the new .NET 4.0 Tuple types. Although they are implemented as classes (because of supporting up to quite many parameters), they implement the new interface IStructuralEquatable which is intended exactly for your purpose.

herzmeister
  • 11,101
  • 2
  • 41
  • 51
  • I will have a look at .NET 4.0 tuple but I'm on .NET 3.5 now. – Simon T. Jan 27 '10 at 21:25
  • @herzmeister, good hint with `IStructuralEquatable` - but the approach from Simon does not work. You may want to revise your answer. – Lucero Jan 27 '10 at 21:50
  • 'aight, it's been a while since i created my own structs. I implemented `.Equals(...)`, `IEquatable.Equals(...)` and `.GetHashCode()` back then but some time in the meantime I think I read something that I got the impression I was over-engineering. But reading that MSDN part about value types again that you dug up confirms that it _was_ a good idea. ;-) – herzmeister Jan 27 '10 at 22:29
0

It depends on what you mean by "same values". If you want the strings themselves to be unequal to one another—rather than just the references—then you're going to have to write your own Equals(MyTuple) method.

Jeffrey L Whitledge
  • 58,241
  • 9
  • 71
  • 99
  • I don't want to end up with two tuples with strings having the same values in my HashSet. Aren't string compared by their values instead of reference? Why I need to implement ´Equals´ then? – Simon T. Jan 27 '10 at 21:13
  • Because that equals-magic will only happen if you write something like `if (a1==a2)` and not if they are embedded in structs, which compare the memory contents AFAIR. See also http://stackoverflow.com/questions/1502451/c-what-needs-to-be-overriden-in-a-struct-to-ensure-equality-operates-properly – Lucero Jan 27 '10 at 21:17
  • Because the default equality comparer for structs compares the bytes not the members, and it doesn't know what those references are referring to. It doesn't even know they're strings. The reason your tests are working right now is probably because you're using string literals which are interned in the executable file. Try generating identical strings in the `System.String(char[])` constructor and see what happens. – Jeffrey L Whitledge Jan 27 '10 at 21:20
  • Ah, ValueType.Equals is a mean beast. According to the MSDN docs it "uses reflection to compare the corresponding fields of obj and this instance. Override the Equals method for a particular type to improve the performance of the method and more closely represent the concept of equality for the type." (see http://msdn.microsoft.com/en-us/library/2dts52z7(VS.85).aspx ) but what exacly is compared is not mentioned (`ReferenceEquals()` or `Equals()` or whatever), so do not rely on this. Last but not least because of performance! – Lucero Jan 27 '10 at 21:24
  • Oh, I stand corrected. That information isn't in the C# language definition, which is what I was checking. – Jeffrey L Whitledge Jan 27 '10 at 21:28
  • @Jeffrey, so was I, but I then did some more research. Anyways, `GetHashCode()` seems to work in a different way than `Equals()` which is making things even worse when not overriding both members, see the ValueType.GetHashCode docs: "One or more fields of the derived type is used to calculate the return value. If you call the derived type's GetHashCode method, the return value is not likely to be suitable for use as a key in a hash table. ... In either case, consider writing your own implementation of the GetHashCode method that more closely represents the concept of a hash code for the type." – Lucero Jan 27 '10 at 21:29
  • @Lucero - Since you did the homework, write up the correct answer and I'll vote you up. I guess I learn something new every day! – Jeffrey L Whitledge Jan 27 '10 at 21:34
  • @Jeffrey: I edited my answer to reflect those details. Thanks! – Lucero Jan 27 '10 at 21:39