24

Let's say I have a List with duplicate values and I want to remove the duplicates.

List<int> myList = new List<int>(Enumerable.Range(0, 10000));

// adding a few duplicates here
myList.Add(1); 
myList.Add(2);
myList.Add(3);

I have found 3 approaches to solve this:

List<int> result1 = new HashSet<int>(myList).ToList(); //3700 ticks
List<int> result2 = myList.Distinct().ToList(); //4700 ticks
List<int> result3 = myList.GroupBy(x => x).Select(grp => grp.First()).ToList(); //18800 ticks
//referring to pinturic's comment:
List<int> result4 = new SortedSet<int>(myList).ToList(); //18000 ticks

In most answers here on SO, the Distinct approach is shown as the "correct one", yet the HashSet is always faster!

My question: is there anything I have to be aware of when I use the HashSet approach and is there another more efficient way?

fubo
  • 44,811
  • 17
  • 103
  • 137

1 Answers1

25

There is a big difference between these two approaches:

List<int> Result1 = new HashSet<int>(myList).ToList(); //3700 ticks
List<int> Result2 = myList.Distinct().ToList(); //4700 ticks

The first one can (will probably) change the order of the elements of the returned List<>: Result1 elements won't be in the same order of myList's ones. The second maintains the original ordering.

There is probably no faster way than the first one.

There is probably no "more correct" (for a certain definition of "correct" based on ordering) than the second one.

(the third one is similar to the second one, only slower)

Just out of curiousity, the Distinct() is:

// Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,712
public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source) {
    if (source == null) throw Error.ArgumentNull("source");
    return DistinctIterator<TSource>(source, null);
}

// Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,722
static IEnumerable<TSource> DistinctIterator<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer) {
    Set<TSource> set = new Set<TSource>(comparer);
    foreach (TSource element in source)
        if (set.Add(element)) yield return element;
}

So in the end the Distinct() simply uses an internal implementation of an HashSet<> (called Set<>) to check for the uniqueness of items.

For completeness sake, I'll add a link to the question Does C# Distinct() method keep original ordering of sequence intact?

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280
  • thank you for confirming my assumption. with "correct" i referred to the accepted answers. – fubo May 21 '15 at 07:12
  • 3
    @fubo In the end it all depends if you want to maintain ordering or not. – xanatos May 21 '15 at 07:12
  • @fubo [`Distinct`](https://msdn.microsoft.com/library/vstudio/bb338049.aspx) uses an `IEqualityComparer<>` , and [`HashSet<>`](https://msdn.microsoft.com/library/bb359100.aspx) uses the same – xanatos May 21 '15 at 07:16
  • If you need to keep the order you can ever relay on SortedSet<> that has the same interface as HashSet – pinturic May 21 '15 at 07:47
  • @pinturic `SortedSet` "gives" an order. 1, 2, 3... perhaps my items where 1, 3, 2 and I *like* to have them 1, 3, 2 – xanatos May 21 '15 at 07:54
  • Yes you are right so you can use OrderedDictionary<> to mimic the behaviour of a set using values equal to keys for example. – pinturic May 21 '15 at 07:57
  • 1
    @pinturic - the SortedSet approach takes about 18000 ticks in my benchmark and is as slow as the 3rd approach – fubo May 21 '15 at 08:27
  • 1
    -1 `Distinct()` returns an unordered set. The fact that the specific implementation you're looking at preserves order is coincidental - have you checked that it works the same on all the Mono/Unity/Xamarin/XBox CLRs? – BlueRaja - Danny Pflughoeft Jul 05 '15 at 19:29