105

I want to remove duplicates from list, without changing order of unique elements in the list.

Jon Skeet & others have suggested to use the following:

list = list.Distinct().ToList();

Reference:

Is it guaranteed that the order of unique elements would be same as before? If yes, please give a reference that confirms this as I couldn't find anything on it in documentation.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Nitesh
  • 2,681
  • 4
  • 27
  • 45
  • 7
    @ColonelPanic - official documentation here https://msdn.microsoft.com/en-us/library/bb348436(v=vs.110).aspx states explicitly "Distinct() method returns an **unordered** sequence that contains no duplicate values". – Evk Nov 22 '17 at 20:51
  • @Evk 'Unordered sequence' is not same as 'original ordering of sequence'. – Nitesh Nov 23 '17 at 11:19
  • 4
    I consider "unoredered" to mean "in no particular order", which also implies "not necessary in original order of sequence". – Evk Nov 26 '17 at 09:55
  • I just had a problem regarding distinct with oracle12 Entity Framework 6. In my case I had orderby before disinct in my linq clause and the order was gone. select().OrderBy().Distinct().ToList() did not work while select().OrderBy().Distinct().ToList() worked. – Karl Apr 05 '18 at 06:36
  • 5
    @Karl, these expressions are the same. :) – pvgoran Sep 24 '18 at 04:38

7 Answers7

86

It's not guaranteed, but it's the most obvious implementation. It would be hard to implement in a streaming manner (i.e. such that it returned results as soon as it could, having read as little as it could) without returning them in order.

You might want to read my blog post on the Edulinq implementation of Distinct().

Note that even if this were guaranteed for LINQ to Objects (which personally I think it should be) that wouldn't mean anything for other LINQ providers such as LINQ to SQL.

The level of guarantees provided within LINQ to Objects is a little inconsistent sometimes, IMO. Some optimizations are documented, others not. Heck, some of the documentation is flat out wrong.

Neil Vass
  • 5,251
  • 2
  • 22
  • 25
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I am accepting it because 1) It clearly answers my concern whether its guaranteed or not 2) The linked post delves deeper into undocumented aspects of Distinct 3) The linked post also has a sample implementation that can be used as reference to implement a Distinct on Lists with that guarantee. – Nitesh Jan 19 '11 at 13:28
  • I note that if an input collection is already known to be sorted then `.Distinct()` could operate in a streaming manner (by simply skipping the current value if it's the same as the last value). This approach could also work for any finite, in-memory `IReadOnlyCollection` by iterating through it once to check if it is already sorted. don't believe that `.Distinct()` checks for `IOrderedEnumerable`, grrr. Despite Linq having been around for 13 years now, I'm surprised it isn't smarter about applying these kinds of optimizations. – Dai Mar 05 '21 at 10:57
  • It would be nice if `List` had a `bool IsSorted` property which is `true` after `List.Sort()` runs and turns `false` if any reordering happens - that way `.Distinct()` could be optimized for that. Oh well.... – Dai Mar 05 '21 at 10:59
  • 1
    @Dai: That sounds generally infeasible - just because something *was* sorted doesn't mean that the data inside the relevant objects is the same as it was at the time of sorting. – Jon Skeet Mar 05 '21 at 11:02
  • @JonSkeet Sorry, I've been working with immutable types a lot lately, I forgot about how mutability of data means we can't have nice things. Though it could still work for record-types in C# 9.0. – Dai Mar 05 '21 at 11:04
  • @Dai: For very specific situations, that's fine - I just wouldn't want or expect it to be part of `List` – Jon Skeet Mar 05 '21 at 11:18
  • Well, if the comparisons are made with a hashset (makes sense for avoiding iterating the unique list over and over), then returning the hashset values should return an unordered sequence. – Daniel Möller Jun 26 '21 at 01:26
31

In the .NET Framework 3.5, disassembling the CIL of the Linq-to-Objects implementation of Distinct() shows that the order of elements is preserved - however this is not documented behavior.

I did a little investigation with Reflector. After disassembling System.Core.dll, Version=3.5.0.0 you can see that Distinct() is an extension method, which looks like this:

public static class Emunmerable
{
    public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source)
    {
        if (source == null)
            throw new ArgumentNullException("source");

        return DistinctIterator<TSource>(source, null);
    }
}

So, interesting here is DistinctIterator, which implements IEnumerable and IEnumerator. Here is simplified (goto and lables removed) implementation of this IEnumerator:

private sealed class DistinctIterator<TSource> : IEnumerable<TSource>, IEnumerable, IEnumerator<TSource>, IEnumerator, IDisposable
{
    private bool _enumeratingStarted;
    private IEnumerator<TSource> _sourceListEnumerator;
    public IEnumerable<TSource> _source;
    private HashSet<TSource> _hashSet;    
    private TSource _current;

    private bool MoveNext()
    {
        if (!_enumeratingStarted)
        {
            _sourceListEnumerator = _source.GetEnumerator();
            _hashSet = new HashSet<TSource>();
            _enumeratingStarted = true;
        }

        while(_sourceListEnumerator.MoveNext())
        {
            TSource element = _sourceListEnumerator.Current;

             if (!_hashSet.Add(element))
                 continue;

             _current = element;
             return true;
        }

        return false;
    }

    void IEnumerator.Reset()
    {
        throw new NotSupportedException();
    }

    TSource IEnumerator<TSource>.Current
    {
        get { return _current; }
    }

    object IEnumerator.Current
    {        
        get { return _current; }
    }
}

As you can see - enumerating goes in order provided by source enumerable (list, on which we are calling Distinct). Hashset is used only for determining whether we already returned such element or not. If not, we are returning it, else - continue enumerating on source.

So, it is guaranteed, that Distinct() will return elements exactly in same order, which are provided by collection to which Distinct was applied.

Dai
  • 141,631
  • 28
  • 261
  • 374
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
  • 9
    Is it a well documented behavior? – abatishchev Jan 19 '11 at 11:49
  • 3
    I mean, could you please give a link or reference that confirms this? The documentation doesn't say anything about it. – Nitesh Jan 19 '11 at 11:49
  • Yes, sure - see the link above – Sergey Berezovskiy Jan 19 '11 at 11:53
  • 5
    The linked answer contains a reference to documentation that says: "The result sequence is unordered." – mgronber Jan 19 '11 at 12:22
  • There's more than one way to interpret the docs. Of course it cannot guarantee an ordered result, it doesn't control the ordering of the source iterator. Assuming a *sane* implementation from the team that worked on Linq is commonly a good guess. Finding out some day that the guess was wrong is just one of thousands of things that can go wrong when the runtime environment for your program is no longer the one you tested against. But go ahead, re-implement the sane implementation if you like to be sure, lazy showed how. – Hans Passant Jan 19 '11 at 13:21
  • Guys, what docs are you talking about - it's disassembled System.Core.dll, Version=3.5.0.0 BTW no tags like Mono, or Silverlight or whatever. Just c#, so we are taliking about common implementation. – Sergey Berezovskiy Jan 19 '11 at 13:30
  • 6
    @lazyberezovsky: The question asks about *guarantees*, not *common implementation*. (As I already said, I'd be surprised if the implementation ever does change across platforms/versions, but that doesn't amount to a guarantee.) – LukeH Jan 19 '11 at 13:36
  • Ok, give me an example of waht is *guaranteed* in .Net Framework. Or we should just answer 'nothing is guaraneed', because we have Mono and everything could change in next service pack? All I said above is following: in version 3.5.0.0 it is guaranteed that distinct keeps order of source enumerable. – Sergey Berezovskiy Jan 19 '11 at 13:40
  • 5
    @lazyberezovsky: I am from C\C++ where a lot of things are undefined and its very common to ask for if something is guaranteed. Also I am using Distinct() in a Silverlight application, which is on both Mac and Windows thats why we cannot settle on 'common implementation' it must be guaranteed. – Nitesh Jan 19 '11 at 13:55
  • 45
    @lazyberezovsky: When people talk about guarantees, they normally mean **documented behaviour** which is reasonable to rely upon. For example, the docs for GroupBy *do* specify behaviour, but the docs for Distinct *don't*. – Jon Skeet Jan 19 '11 at 13:57
15

According to the documentation the sequence is unordered.

mgronber
  • 3,399
  • 16
  • 20
  • 3
    Additional info to find it: In the link, refer to the "Remarks" section. "The result sequence is unordered." – Curtis Yallop May 06 '14 at 23:46
  • Link refers to Enumerable.Distinct rather than the .Distinct extension method – Ryan Williams Nov 19 '21 at 01:07
  • 1
    Those are the extension methods as can be seen on the method signatures. The Enumerable class _provides a set of static methods for querying objects that implement IEnumerable_. :-) – mgronber Dec 08 '21 at 14:15
8

Yes, Enumerable.Distinct preserves order. Assuming the method to be lazy "yields distinct values are soon as they are seen", it follows automatically. Think about it.

The .NET Reference source confirms. It returns a subsequence, the first element in each equivalence class.

foreach (TSource element in source)
    if (set.Add(element)) yield return element;

The .NET Core implementation is similar.

Frustratingly, the documentation for Enumerable.Distinct is confused on this point:

The result sequence is unordered.

I can only imagine they mean "the result sequence is not sorted." You could implement Distinct by presorting then comparing each element to the previous, but this would not be lazy as defined above.

Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • 10
    The source isn't the specification. What you found is a coincidence and could be invalid after the next update. – H H Nov 24 '17 at 13:08
  • 1
    @HenkHolterman In general, I'd agree, implementations can change. For example, .NET 4.5 [changed the sorting algorithm](https://stackoverflow.com/a/12184039/284795) behind Array.Sort. In this particular case however, any sensible implementation of Enumerable.Distinct will surely be lazy ("yields distinct values are soon as they are seen"), and the order-preserving property follows from that. Lazy evaluation is a core tenet of LINQ to Objects; to rescind it would be unthinkable. – Colonel Panic Nov 30 '17 at 12:17
  • 1
    I've seen implementations using .net 4.6 where calling `dbQuery.OrderBy(...).Distinct().ToList()` does not return a list in the order specified by the order by predicate - removing the Distinct (which happened to be redundant) fixed the bug in my case – Rowland Shaw Jun 20 '18 at 16:28
  • 1
    @RowlandShawMight Queryable is different from Enumerable. You should check the query that is generated. – Wouter Jul 01 '21 at 10:30
3

A bit late to the party, but no one really posted the best complete code to accomplish this IMO, so let me offer this (which is essentially identical to what .NET Framework does with Distinct())*:

    public static IEnumerable<T> DistinctOrdered<T>(this IEnumerable<T> items)
    {
        HashSet<T> returnedItems = new HashSet<T>();
        foreach (var item in items)
        {
            if (returnedItems.Add(item))
                yield return item;
        }                       
    }

This guarantees the original order without reliance on undocumented or assumed behavior. I also believe this is more efficient than using multiple LINQ methods though I'm open to being corrected here.

(*) The .NET Framework source uses an internal Set class, which appears to be substantively identical to HashSet.

Emperor Eto
  • 2,456
  • 2
  • 18
  • 32
  • 3
    It doesn't matter because the order will be dictated by the enumerable being provided. The HashSet just tells us whether the item has been returned yet and if so skips it. – Emperor Eto Nov 20 '21 at 11:31
1

By default when use Distinct linq operator uses Equals method but you can use your own IEqualityComparer<T> object to specify when two objects are equals with a custom logic implementing GetHashCode and Equals method. Remember that:

GetHashCode should not used heavy cpu comparision ( eg. use only some obvious basic checks ) and its used as first to state if two object are surely different ( if different hash code are returned ) or potentially the same ( same hash code ). In this latest case when two object have the same hashcode the framework will step to check using the Equals method as a final decision about equality of given objects.

After you have MyType and a MyTypeEqualityComparer classes follow code not ensure the sequence maintain its order:

var cmp = new MyTypeEqualityComparer();
var lst = new List<MyType>();
// add some to lst
var q = lst.Distinct(cmp);

In follow sci library I implemented an extension method to ensure Vector3D set maintain the order when use a specific extension method DistinctKeepOrder:

relevant code follows:

/// <summary>
/// support class for DistinctKeepOrder extension
/// </summary>
public class Vector3DWithOrder
{
    public int Order { get; private set; }
    public Vector3D Vector { get; private set; }
    public Vector3DWithOrder(Vector3D v, int order)
    {
        Vector = v;
        Order = order;
    }
}

public class Vector3DWithOrderEqualityComparer : IEqualityComparer<Vector3DWithOrder>
{
    Vector3DEqualityComparer cmp;

    public Vector3DWithOrderEqualityComparer(Vector3DEqualityComparer _cmp)
    {
        cmp = _cmp;
    }

    public bool Equals(Vector3DWithOrder x, Vector3DWithOrder y)
    {
        return cmp.Equals(x.Vector, y.Vector);
    }

    public int GetHashCode(Vector3DWithOrder obj)
    {
        return cmp.GetHashCode(obj.Vector);
    }
}

In short Vector3DWithOrder encapsulate the type and an order integer, while Vector3DWithOrderEqualityComparer encapsulates original type comparer.

and this is the method helper to ensure order maintained

/// <summary>
/// retrieve distinct of given vector set ensuring to maintain given order
/// </summary>        
public static IEnumerable<Vector3D> DistinctKeepOrder(this IEnumerable<Vector3D> vectors, Vector3DEqualityComparer cmp)
{
    var ocmp = new Vector3DWithOrderEqualityComparer(cmp);

    return vectors
        .Select((w, i) => new Vector3DWithOrder(w, i))
        .Distinct(ocmp)
        .OrderBy(w => w.Order)
        .Select(w => w.Vector);
}

Note : further research could allow to find a more general ( uses of interfaces ) and optimized way ( without encapsulate the object ).

Lorenzo Delana
  • 532
  • 5
  • 11
1

This highly depends on your linq-provider. On Linq2Objects you can stay on the internal source-code for Distinct, which makes one assume the original order is preserved.

However for other providers that resolve to some kind of SQL for example, that isn´t neccessarily the case, as an ORDER BY-statement usually comes after any aggregation (such as Distinct). So if your code is this:

myArray.OrderBy(x => anothercol).GroupBy(x => y.mycol);

this is translated to something similar to the following in SQL:

SELECT * FROM mytable GROUP BY mycol ORDER BY anothercol;

This obviously first groups your data and sorts it afterwards. Now you´re stuck on the DBMS own logic of how to execute that. On some DBMS this isn´t even allowed. Imagine the following data:

mycol anothercol
1     2
1     1
1     3
2     1
2     3

when executing myArr.OrderBy(x => x.anothercol).GroupBy(x => x.mycol) we assume the following result:

mycol anothercol
1     1
2     1

But the DBMS may aggregate the anothercol-column so, that allways the value of the first row is used, resulting in the following data:

mycol anothercol
1    2
2    1

which after ordering will result in this:

mycol anothercol
2    1
1    2

This is similar to the following:

SELECT mycol, First(anothercol) from mytable group by mycol order by anothercol;

which is the completely reverse order than what you expected.

You see the execution-plan may vary depending on what the underlying provider is. This is why there´s no guarantee about that in the docs.

MakePeaceGreatAgain
  • 35,491
  • 6
  • 60
  • 111