Linq Distinct not being very Distinct

Question

Is this linq statement correct for getting unique values of 'Name' from my newSpecs collection?

distinctSpecialties is showing duplicates.

IEnumerable<SpecialtyData> distinctSpecialties = newSpecs
    .Select(s => s).Distinct()
    .OrderBy(s => s.Name);

How is equality defined for `SpecialtyData`? – Austin Salonen Dec 10 '13 at 17:45 — Austin Salonen, Dec 10 '13 at 17:45

D Stanley · Answer 1 · 2013-12-10T17:47:58.640

4

You need to define "distinct" - by default reference types are distinct if they are separate instances (ReferenceEquals). You can override the default by either:

Implementing IEquatable<SpecialtyData> in SpecialtyData
Overriding Equals and GetHashCode in SpecialtyData
defining an IEqualityComparer<SpecialtyData> and passing that to Distinct

A simpler way to get the "first" distinct item by name is to use GroupBy:

IEnumerable<SpecialtyData> distinctSpecialties = newSpecs.GroupBy(s => s.Name)
                                                         .Select(g => g.First());

edited Dec 10 '13 at 17:47

answered Dec 10 '13 at 17:46

D Stanley

149,601
11
178
240

Excellent answer. Distinct’s default behavior is a common source of confusion and it’s important to know how to override it. – Jason Enochs Dec 10 '13 at 18:16

score 4 · Accepted Answer · edited Dec 10 '13 at 17:52

You're asking about getting unique values of 'Name' from my newSpecs collection, but your code returns IEnumerable<SpecialtyData>. Something is wrong: either the question or the return type.

If you need only names, you need IEnumerable<string> instead:

IEnumerable<string> distinctNames = newSpecs
    .Select(s => s.Name)
    .Distinct().OrderBy(x => x);

If you need entire SpecialtyData instances, you can either use GroupBy+First:

IEnumerable<SpecialtyData> distinctSpecialties = newSpecs
    .GroupBy(s => s.Name)
    .Select(g=> g.First())

or override Equals and GetHashCode methods within SpecialtyData class to make Distinct work.

evanmcdonnal · Answer 3 · 2013-12-10T17:50:59.650

That's because the default comparison is a reference comparison. It's not looking at the data itself, it's giving you distinct references. Either use your Select() to get some value worth comparing or implement IEqualityComparer<T> for your type and pass it to Distinct.

Below is a sample implementation of IEqualityComparer<T>;

class SpecialtyDataEqualityComparer : IEqualityComparer<SpecialtyData>
{
    public bool Equals(SpecialtyData lhs, SpecialtyData rhs)
    {
        return lhs.Name == rhs.Name;
    }

    public int GetHashCode(SpecialtyData p)
    {
        return p.Name.GetHashCode();
    }
}

If you did .Distinct(new SpecialtyDataEqualitComparer()) then you would get items with distinct names. If you want some other definition of equality then change the logic in Equals to do the comparisons you require (make related changes to GetHashCode so they are consistent, if you do equality by name you shouldn't take the hashcode for some other property).

score 2 · Answer 4 · answered Dec 10 '13 at 17:58

2

Instead of doing custom equality comparers and all that, there is a really nice extension method from MoreLinq - https://code.google.com/p/morelinq/ (by Jon Skeet).

Either download the whole library of MoreLinq, or just add this code below:

public static IEnumerable<TSource> DistinctBy<TSource, TKey>(
                            this IEnumerable<TSource> source, Func<TSource, TKey> selector)
{
    var set = new HashSet<TKey>();
    return source.Where(element => set.Add(selector(element)));
}

And use it like so:

IEnumerable<SpecialtyData> distinctSpecialties = newSpecs
    .Select(s => s).DistinctBy(s => s.Name)
    .OrderBy(s => s.Name);

answered Dec 10 '13 at 17:58

EkoostikMartin

6,831
2
33
62

@RobertHarvey - I use this A LOT, get very solid performance and don't have to write equality comparers for the 599323 entities I need to run `Distinct` on. – EkoostikMartin Dec 10 '13 at 18:02
Tbh - i dont think you even need the select part. Just .DistinctBy. Just a thought tho, should the Func<> be Expression> to leverage server side expression tree?. – jim tollan Dec 10 '13 at 18:32
Oh and make source IQueryable of course :) (oh and +1 into the bargain!) – jim tollan Dec 10 '13 at 18:42

score 1 · Answer 5 · answered Dec 10 '13 at 17:46

1

You can use GroupBy as below

IEnumerable<SpecialtyData> distinctSpecialties = 
               newSpecs.GroupBy(s => s.Name).Select(g=> g.First()).OrderBy(s => s.Name);

answered Dec 10 '13 at 17:46

Damith

62,401
13
102
153

score 0 · Answer 6 · answered Dec 10 '13 at 17:47

In order to use Distinct() on a custom object (SpecialtyData), you will need to override the two methods Distinct() uses to determine equality.

    public override bool Equals(object other) {

    }

    public override int GetHashCode() {

    }

Each of these two should return values that define the SpecialtyData objects as unique. In your case it may be as simple as evaluating the 'Name' property.

score 0 · Answer 7 · edited May 23 '17 at 12:29

0

There's a very nice post from James Michael Hare that explains why you're getting this results, and I'll quote it here:

For Distinct() (and many other LINQ features) to work, the class being compared (SpecialtyData in your example) must implement Equals() and GetHashCode(), or alternatively provide a separate IEqualityComparer<T> as an argument to Distinct().

Many LINQ methods take advantage of GetHashCode() for performance because internally they will use things like a Set<T> to hold the unique items, which uses hashing for O(1) lookups. Also, GetHashCode() can quickly tell you if two objects may be equivalent and which ones are definitely not - as long as GetHashCode() is properly implemented of course.

So you should make all your classes you intend to compare in LINQ implement Equals() and GetHashCode() for completeness, or create a separate IEqualityComparer<T> implementation.

Original post: This code returns distinct values. However, what I want is to return a strongly typed collection as opposed to an anonymous type

edited May 23 '17 at 12:29

Community

1
1

answered Dec 10 '13 at 17:48

OnoSendai

3,960
2
22
46

2

Yeah, so why not mark this as a duplicate...? – BartoszKP Dec 10 '13 at 17:49
If you see a duplicate question you should flag/vote to close as a duplicate. Copying an entire answer is not appropriate. – Servy Dec 10 '13 at 17:50
@BartoszKP: Unfortunately, that question is not a duplicate. Although it has the correct answer for this question, the original question is a very specific problem that doesn't much resemble this one. – Robert Harvey Dec 10 '13 at 17:51
@RobertHarvey Indeed, perhaps it's rather the other question should be marked as a duplicate of this one then? – BartoszKP Dec 10 '13 at 17:54
@BartoszKP: That's not true either. – Robert Harvey Dec 10 '13 at 17:55
If there was a way to allow for direct inline mentions of specific (but not exactly duplicated) answers, I'd gladly do so, @Servy. Or, is there one and I'm not aware of? In any case, I appreciate any comments on best practices, so thank you all. – OnoSendai Dec 10 '13 at 17:59
@OnoSendai: You can post a link to the answer below the question. – Robert Harvey Dec 10 '13 at 17:59
1

To add to Robert's comment, you should generally prefix the link with something, i.e. "related" or "possible duplicate" or whatever, rather than just a bare link. If a question *isn't* a duplicate, then it's a sign that the answer of the other question, by itself, probably shouldn't be the answer to this question. Even if it has most of the information; if the questions aren't dups there's a *very* high probability that it means the answer should be tailored to the new question. – Servy Dec 10 '13 at 18:00
@RobertHarvey, will do so from now on, and with that prefix, Servy. Thanks once again. – OnoSendai Dec 10 '13 at 18:02

Linq Distinct not being very Distinct

7 Answers7