Need help understanding unexpected behavior using LINQ Join with HashSet

Question

I encountered some odd behavior using C# HastSet with LINQ's Join method that I don't understand. I've simplified what I am doing to help focus on the behavior I am seeing.

I have the following:

 private HashSet<MyClass> _mySet; // module level

 IEnumerable<ISearchKey> searchKeys; // parameter.
 // Partial key searches are allowed.

 private IEqualityComparer<ICoreKey> _coreKeyComparer; // Module level.
 // Compares instances of MyClass and ISearchKey to determine 
 // if they match.

Given that

There is a 1-to-many relationship between searchKeys and _mySet.
MyClass implements interface IPartialKey and ICoreKey.
ISearchKey inherits from IPartialKey and ICoreKey.
MyClass and ISearchKey instance both override the GetHashCode method.
MyClass's hash code value is based on its full key value, which includes its ICoreKey and IPartialKey values plus other fields.
The full keys used by MyClass are not unique. Two different MyClass instances can have the same hash code.
ISearchKey's hash code value is based only on its ICoreKey and IPartialKey values. i.e. The ISearchKey hash code might differ from the hash code for a matching MyClass instance. (Side note: in the case where I first encountered the problem, the ISearchKey's IPartialKey values match the MyClass full key, so the GetHashCode methods would return the same value for both ISearchKey and MyClass. I included the additional complexity to better illustrate the underlying logic on what I am doing.)
The _coreKeyComparer.GetHashCode method returns the same value for matching instances of ISearchKey and MyClass using only their ICoreKey values.
The _coreKeyComparer.Equals method cast the parameters to MyClass and ISearchKey respectively and returns true if their IPartialKey values match. (Side note: _coreKeyComparer has been HEAVILY tested and works correctly.)

I expected a join between the two collections should result in something like:

{searchKey_a, myClass_a1},
{searchKey_a, myClass_a2},
{searchKey_a, myClass_a3},
{searchKey_b, myClass_b1},
{searchKey_b, myClass_b2},
{searchKey_c, myClass_c1},
{searchKey_c, myClass_c2},
{searchKey_c, myClass_c3},
{searchKey_c, myClass_c4},
etc....

i.e The same ISearchKey instance would occur multiple times, once for each matching MyClass instance it is joined to.

But when I do a join from searchKeys to _mySet:

        var matchedPairs = searchKeys
          .Join(
            _mySet,
            searchKey => searchKey,
            myClass => myClass,
            (searchKey, myClass) => new {searchKey, myClass},
            _coreKeyComparer)
            .ToList();

I only get one MyClass instance per searchKeyClass instance. i.e. The matchedPairs collection looks like:

    {searchKey_a, myClass_a1},
    {searchKey_b, myClass_b1},
    {searchKey_c, myClass_c1},
etc....

However if I reverse the join, go from _mySet to searchKeys:

   var matchedPairs = _mySet
          .Join(
            searchKeys,
            myClass => myClass,
            searchKey => searchKey,
            (myClass, searchKey) => new {searchKey, myClass},
            _coreKeyComparer)
            .ToList();

I get the correct matchedPairs collection. All the matching records from _mySet are returned along with the searchKey which they matched against.

I checked the documentation and examined multiple examples and don't see any reason why the searchKeys-to-_mySet Join gives the wrong answer, while the _mySet-to-searchKeys gives the correct/different answer.

(Side note: I also tried GroupJoin from searchKeys to _myset and go similiar results. i.e. Each searchKeyClass instance found at most one result from _mySet.)

Either I don't understand how the Join method is supposed to work, or Join works differently with HashSet than it does with List or other type of collection.

If the former, I need clarification so I don't make mistakes using Join in the future.

If the latter, then is this differing behavior a .Net bug, or is this the correct behavior with HashSet?

Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.

Just to be clear, I've already fixed my code so it return the correct results, I just want to understand why I got incorrect results initially.

I'm not familiar with different behavior using `Join` with any different type of enumerable, `List`, `HashSet`, `T[]`, etc should all behave the same. — Matthew, Jan 03 '19 at 16:27
I believe your comparer violates the requirements for a comparer: [In more formal terms, GetHashCode must always return the same value for two objects that compare equal](https://stackoverflow.com/a/22453152/2557128). — NetMage, Jan 03 '19 at 17:06
@NetMage: The comparer DOES return the same hash code value for matching values. That is item 8 in my Givens list. — RB Davidson, Jan 03 '19 at 17:13
Without seeing more of your code, and ideally a *small, simple, complete example that I can actually run on my machine*, it is hard to know what you are doing wrong. Your statement "Either I don't understand Join or Join is broken" is likely incorrect; probably there is a bug in your comparison logic. This attitude that the bug is not in your crazy code but is in well-tested, simple framework code has a name, in fact it is called "select isn't broken". — Eric Lippert, Jan 03 '19 at 18:26
@RBDavidson That isn't what I was stating: will `GetHashCode` always return the same value for values that are `Equal`, given they are based on different fields? — NetMage, Jan 03 '19 at 18:57
@EricLippert I don't have an attitude on this, nor did I claim there wasn't a problem in my code. I asked for help in understanding what the problem is. I never said Join is broken, but it is working differently with HashSet than I've seen with other collection types. I *asked* if the difference is defined behavior or an unintended bug. .Net is tested, but does have bugs, which is why MS has a bug reporting site. I'm not convinced this is a bug, but I don't understand the behavior. Inner joins should return the same set regardless of direction, though the order of results may be differ. — RB Davidson, Jan 03 '19 at 19:09
So far you've stated "my comparison code is correct", "either I don't understand how join works, or Join works differently with hash sets, or .NET has a bug". Well, you understand how joins work, Join does not work differently with hash sets, and .NET does not have a bug, so what you have here is a collection of statements which contradict each other, and one of them has to be wrong. My money is on "my comparison code is correct" being the false statement. — Eric Lippert, Jan 03 '19 at 19:15
@NetMage Individual instances of MyClass *may* return a different hash code than a matching ISearchKey (in the case where I first encountered this problem they both returned the same hash code) but in the join I specify an IEqualityComparer whose GetHashCode() **WILL** return the same value for matching MyClass and ISearchKey instances, even if their native hash codes do not match. In simple terms, ICoreKey is a set of values in MyClass and ISearchKey which MUST match before there is a chance for equality. If the core values match then the Equals does a partial key match. — RB Davidson, Jan 03 '19 at 19:23
@EricLippert The IEqualityComparer is correct. I have HEAVILY tested it. That is what I thought was the problem to begin with and it was only after a lot of testing that I examined the join itself. If the IEqualityComparer is wrong, wouldn't i get the wrong result whether I used searchKeys.Join( _mySet...) or _mySet.Join(searchKeys...)? I don't. searchKeys.Join( _mySet...) returns 1 MyClass per searchKey, _mySet.Join(searchKeys...) returns ALL the MyClass that match the searchKeys. I've done a LOT of joins and normally A.join(B) == B.Join(A), but not with hash sets. — RB Davidson, Jan 03 '19 at 19:30
Well then, my advice continues to be: **create the minimal code that reproduces the defect**. You should be able to get this down to a few dozen lines of extremely clear code that we all can run on our machines. When you do so, either you'll find your mistake, or you will have a program that someone else can find the mistake, or, you'll have a clear repro for your bug report to Microsoft. — Eric Lippert, Jan 03 '19 at 19:35
Also, use science. You have a hypothesis: joins are wrong with hash sets. OK, so start trying stuff. **Make a copy of the hash set into a list and run the test again**. Did you get a different result? That's evidence for your hypothesis. But if you got the same result then your hypothesis is false and should be abandoned. — Eric Lippert, Jan 03 '19 at 19:36
@EricLippert You keep claiming I said joins are broken. I *NEVER* made that claim, and I ask you to stop saying I did. I *observed* different behavior when joining a List to a HashSet than I've seen when joining other collection types (ex: List to List). The join isn't symmetrical, the order determines whether you get 9 records back or 215 (actual results). That is odd for an *inner* join. Maybe that's defined behavior for HashSet with collisions on the hash codes. If so, I'm cool with that, but there's nothing in the documentation about that. Which is why I asked my questions. — RB Davidson, Jan 03 '19 at 19:59
It's not the expected behaviour. That would be seriously broken. **It is not observed to happen in correct code.** That sort of thing is observed to happen in code where the comparison logic violates the rules of equality. Again: make a small reproducer that clearly demonstrates the behaviour, and you'll find your bug. — Eric Lippert, Jan 03 '19 at 20:07
Another way you might try to find your bug is to attack the equality problem directly. For example: make your comparison mechanism accumulate a list of every input it receives. When that list gets to be, say, 1000 items, then go into a verification mode. For every item in the list, verify that A = A is true. For every pair of items in the list, verify that A = B is the same as B = A. For all of those where it is true, check whether A = C matches B = C for every C. Worst case, that's only a billion checks. Then reset and accumulate another 1000. — Eric Lippert, Jan 03 '19 at 23:14

Eric Lippert · Accepted Answer · 2019-01-03T18:56:29.643

Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.

Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.

Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join does, and perhaps that will help.

Join takes the following:

An "outer" collection -- the receiver of the Join.
An "inner" collection -- the first argument of the extension method
Two key extractors, that extract a key from the outer and inner collections
A projection, that takes a member of the inner and outer collections whose keys match, and produces the result for that match
A comparison operation that compares two keys for equality.

Here's how Join works. (This is logically what happens; the actual implementation details are somewhat optimized.)

First, we iterate over the "inner" collection, exactly once.

For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.

Thus, we now have a lookup from TKey to IEnumerable<TInner>.

Second, we iterate over the "outer" collection, exactly once.

For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.

We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.

That is, Join behaves like this pseudocode implementation:

static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
  (IEnumerable<TOuter> outer, 
  IEnumerable<TInner> inner, 
  Func<TOuter, TKey> outerKeySelector, 
  Func<TInner, TKey> innerKeySelector, 
  Func<TOuter, TInner, TResult> resultSelector, 
  IEqualityComparer<TKey> comparer) 
{
  var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
  foreach(TInner innerItem in inner)
  {
    TKey innerKey = innerKeySelector(innerItem);
    lookup.Add(innerItem, innerKey);
  }
  foreach (TOuter outerItem in outer) 
  {
    TKey outerKey = outerKeySelector(outerItem);
    foreach(TInner innerItem in lookup[outerKey])
    {
      TResult result = resultSelector(outerItem, innerItem);
      yield return result;
    }
  }
}

Some suggestions:

Replace all your GetHashCode implementations so that they return 0, and run all your tests. They should pass! It is always legal to return zero from GetHashCode. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value of GetHashCode, then you have a bug.
Test your key comparison to ensure that it is a valid comparison. It must obey the three rules of equality: (1) reflexivity: a thing always equals itself, (2) symmetry: the equality of A and B must be the same as B and A, (3) transitivity: if A equals B and B equals C then A must equal C. If these rules are not met then Join can behave weirdly.
Replace your Join with a SelectMany and a Where. That is:

from o in outer join i in inner on getOuterKey(o) equals getInnerKey(i) select getResult(o, i)

can be rewritten as

from o in outer
from i in inner
where keyEquality(getOuterKey(o), getInnerKey(i))
select getResult(o, i)

That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.

Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.

Thanks for this amazing clear answer Eric -- it shows how my answer was wrong since I assumed you did not iterate over the inner selector. I do have one question -- HashSet (at https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.hashset-1?view=netframework-4.7.2) says HashSet cannot contain duplicate elements. Are duplicate elements determined by equity operator? Could that have an effect? Or an effect in the `lookup[outerKey]` in your code? — Hogan, Jan 03 '19 at 19:02
@Hogan: We *require* that if two elements are *equal* then their hash codes must be equal. A hash set detects an attempt to add a duplicate by first hashing the proposed new element and checking to see if there are any elements already in the hash set that have that hash code. If the answer is "no", then there are no duplicates because unequal hashes imply "not equal". If there are hash collisions then each possible collision is checked for equality with the proposed new item, and only if none are equal is the new item added. — Eric Lippert, Jan 03 '19 at 19:09
@Hogan: The lookup does something similar to the keys and values. When you add a new [key, value] pair to the lookup, the first thing it does is hashes the new key and checks to see if the lookup key set already contains a key with that hash. If so, then it checks for equality; that's how it knows which value set to insert the new value into. Then a similar process happens on the values. — Eric Lippert, Jan 03 '19 at 19:11
@EricLippert So the multi-dictionary maps from key to unique values? The `IEnumerable` won't have duplicates? I don't see where `Grouping.Add` does that? — NetMage, Jan 03 '19 at 19:37
@NetMage: Oh, good catch; that's a really good question. You know, I just recently wrote my own multidictionary that had that property and so it was on my mind. Offhand I do not recall what the built-in lookup does but I believe it allows duplicates in the values set. — Eric Lippert, Jan 03 '19 at 19:39
@NetMage: I checked. It allows duplicates in the values, as you'd expect. — Eric Lippert, Jan 03 '19 at 19:44
@EricLippert, I now realize I didn't read your initial answer closely enough. The bug was in my IEqualityComparer, which I wrongly insisted was correct. The need for partial matches caused transitivity to fail, which meant the initial the loop over the inner hashset returned false when it should have returned true. Rule 1: Never swear the problem isn't in a particular chunk of code, because it will always be in that code. I couldn't see a way to fix the comparer to handle partial matches against a set of searchkeys, so I refactored code to avoid join. Thanks for your help. — RB Davidson, Jan 07 '19 at 19:06
@RBDavidson: You're welcome. "Select isn't broken" can be a hard lesson to learn, and I've re-learned it many times. (And in my case it is exacerbated because I cannot rely on "the compiler probably isn't broken"! :-) ) — Eric Lippert, Jan 07 '19 at 19:15
@EricLippert Regarding my claim that "Either I don't understand how the Join method is supposed to work or...." Well, I *didn't* understand how the Join method worked. I assumed Join only compared items in the outer collection against those in the inner. I didn't know it looped over the inner collection first, comparing those elements against each other. When I confirmed the behavior matched your explanation, I realized there would be cases where ISearchKey would equal two different MyClass instances which would not equal each other. Thanks for enlightening me on the Join method's logic. — RB Davidson, Jan 07 '19 at 19:21
@RBDavidson: If you don't build an inner-key-lookup table then the join operation is `O(inner * outer)` in the sizes of the inner and outer sequences. If you do build the key lookup then the join operation is `O(inner + outer)` which is almost always much better. But making that optimization requires that the key set be partitioned into equivalence sets, which requires an equivalence relation, which requires transitivity. — Eric Lippert, Jan 07 '19 at 19:25
@RBDavidson: Further reading: If the subject of "ways people implement comparisons wrong" interests you, see https://ericlippert.com/2011/01/20/bad-comparisons-part-one/. If the subject "ways people implement GetHashCode wrong" interests you, see https://blogs.msdn.microsoft.com/ericlippert/2011/02/28/guidelines-and-rules-for-gethashcode/ — Eric Lippert, Jan 07 '19 at 19:33

Need help understanding unexpected behavior using LINQ Join with HashSet

Given that

I expected a join between the two collections should result in something like:

But when I do a join from searchKeys to _mySet:

However if I reverse the join, go from _mySet to searchKeys:

1 Answers1

Linked