14

I'm having trouble coming up with the most efficient algorithm to remove duplicates from List<List<int>>, for example (I know this looks like a list of int[], but just doing it that way for visual purposes:

my_list[0]= {1, 2, 3};
my_list[1]= {1, 2, 3};
my_list[2]= {9, 10, 11};
my_list[3]= {1, 2, 3};

So the output would just be

new_list[0]= {1, 2, 3};
new_list[1]= {9, 10, 11};

Let me know if you have any ideas. I would really appreciate it.

Leniel Maccaferri
  • 100,159
  • 46
  • 371
  • 480
marseilles84
  • 396
  • 2
  • 7
  • 21

6 Answers6

13

Build custom of EqualityComparer<List<int>>:

public class CusComparer : IEqualityComparer<List<int>>
{
    public bool Equals(List<int> x, List<int> y)
    {
        return x.SequenceEqual(y);
    }

    public int GetHashCode(List<int> obj)
    {
        int hashCode = 0;

        for (var index = 0; index < obj.Count; index++)
        {
            hashCode ^= new {Index = index, Item = obj[index]}.GetHashCode();
        }

        return hashCode;
    }
}

Then you can get the result by using Distinct with custom comparer method:

var result = my_list.Distinct(new CusComparer());

Edit:

Include the index into method GetHashCode to make sure different orders will not be equal

cuongle
  • 74,024
  • 28
  • 151
  • 206
  • 3
    That hashcode will lead to a lot of collisions - e.g. `{a,a}` will collide for any `a`, `{a,b}` will collide with `{b,a}` as will any permutation... (Although it may be that you _want_ permutations to collide, in which case, nice answer!) – Rawling Oct 08 '12 at 15:42
  • @Rawling: you are correct, waiting the answer form Tim's comment, I am trying to fix – cuongle Oct 08 '12 at 15:46
  • Here is a hash code generator I feel is better: `return obj.Take(5).Aggregate(1, (current, item) => (current * 37) + item.GetHashCode());` First off, I wouldn't iterate the whole sequence. Hashes are only efficient if they are generated quickly; iterating the whole list defeats that purpose. The first 5 or so (edit that number as needed) should be good. If the first several are the same the lists are likely different. Next, a good general algorithm form combining N different hashes into one is to go through each, multiply the current a prime number, and then add the next hash. – Servy Oct 08 '12 at 16:09
10

This simple program does what you want:

using System;
using System.Collections.Generic;
using System.Linq;

namespace ConsoleApplication6
{
    class Program
    {
        static void Main(string[] args)
        {
            List<List<int>> lists = new List<List<int>>();

            lists.Add(new List<int> { 1, 2, 3 });
            lists.Add(new List<int> { 1, 2, 3 });
            lists.Add(new List<int> { 9, 10, 11 });
            lists.Add(new List<int> { 1, 2, 3 });

            var distinct = lists.Select(x => new HashSet<int>(x))
                    .Distinct(HashSet<int>.CreateSetComparer());

            foreach (var list in distinct)
            {
                foreach (var v in list)
                {
                    Console.Write(v + " ");
                }

                Console.WriteLine();
            }
        }
    }
}
Leniel Maccaferri
  • 100,159
  • 46
  • 371
  • 480
10
    var finalList = lists.GroupBy(x => String.Join(",", x))
                         .Select(x => x.First().ToList())
                         .ToList();
L.B
  • 114,136
  • 19
  • 178
  • 224
6

You can use the LINQ Distinct overload that takes a comparer. The comparer should see if the lists are equal. Note that the default equals operations of lists won't do what you're really looking for, so the comparer will need to loop through each for you. Here's an example of such a comparer:

public class SequenceComparer<T> : IEqualityComparer<IEnumerable<T>>
{
    IEqualityComparer<T> itemComparer;
    public SequenceComparer()
    {
        this.itemComparer = EqualityComparer<T>.Default;
    }

    public SequenceComparer(IEqualityComparer<T> itemComparer)
    {
        this.itemComparer = itemComparer;
    }

    public bool Equals(IEnumerable<T> x, IEnumerable<T> y)
    {
        if (object.Equals(x, y))
            return true;
        if (x == null || y == null)
            return false;
        return x.SequenceEqual(y, itemComparer);
    }

    public int GetHashCode(IEnumerable<T> obj)
    {
        if (obj == null)
            return -1;
        int i = 0;
        return obj.Aggregate(0, (x, y) => x ^ new { Index = i++, ItemHash = itemComparer.GetHashCode(y) }.GetHashCode());
    }
}

Update: I got the idea of using an anonymous type to make a better hash from Cuong Le's answer, and I LINQ-ified it and made it work in my class.

Tim S.
  • 55,448
  • 7
  • 96
  • 122
  • Note that T in List has to implement IComparable. If T is a custom type, you'll have to do this yourself. – Michael Sallmen Oct 08 '12 at 15:32
  • @MichaelSallmen I've shown an implementation that optionally takes an `IEqualityComparer` to specify how to compare `T` object. Good point, though. (the default equality comparer on a type without appropriate comparison interfaces defined will only check for reference equality) – Tim S. Oct 08 '12 at 15:47
  • @Servy yes, indeed. It's just an example implementation, though. Writing a good one is not trivial (e.g. see Coung Le's better implementation, that still has issues). Perhaps multiplying each item's hash by the next larger prime number before XORing would be better? – Tim S. Oct 08 '12 at 15:49
  • @TimS. I would just leave a `//TODO generate hash` rather than putting what you have; at least that way readers will know they need to find a good algorithm on their own without thinking that this is fine. – Servy Oct 08 '12 at 15:56
  • @Servy I've replaced it with a good implementation. Otherwise, I'd agree with you. – Tim S. Oct 08 '12 at 16:06
1

For small sets of data, a comparer could be useful, but if you have 1000 or more List> then trying to compare them all could begin to take a long amount of time.

I suggest that you instead use your data to build a distinct tree. The building of the tree will be much faster and when you are done you can always bring your data back into your old data structure.

Jastill
  • 656
  • 3
  • 15
0

I wanted to compare the performance of the answers of @Leniel Macaferi and @L.B as I wasn't sure which would be more performant, or if the difference would be significant. It turns out that the difference is very significant:

Method 1: 00:00:00.0976649 @Leniel Macaferi
Method 2: 00:00:32.0961650 @L.B

Here is the code I used to benchmark them:

public static void Main(string[] args)
        {
            var list = new List<List<int>> {new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,}, new List<int> {1, 2, 31, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 6}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 9, 10, 11, 1}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 9}, new List<int> {1, 2, 31, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 6, 7}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 9, 10, 11}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,}, new List<int> {1, 2, 31, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 6}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 9, 10, 11}};

            var sw1 = new Stopwatch();
            sw1.Start();

            for (var i = 0; i < 1_000_000; i++)
            {
                var distinct = list.Select(x => new HashSet<int>(x)).Distinct(HashSet<int>.CreateSetComparer());
            }

            sw1.Stop();
            Console.WriteLine($"Method 1: {sw1.Elapsed}");

            var sw2 = new Stopwatch();
            sw2.Start();
            for (var i = 0; i < 1_000_000; i++)
            {
                var distinct = list.GroupBy(a => string.Join(",", a)).Select(a => a.First()).ToList();

            }
            sw2.Stop();
            Console.WriteLine($"Method 2: {sw2.Elapsed}");

            Console.ReadKey();
        }
Aalawlx
  • 593
  • 5
  • 25