0

What is the most efficient way to remove duplicates in an IList in C# without Linq

I have the following code from another SO [1],

IList<IList<int>> output = new List<IList<int>>(); 
var lists = output;
for (int i = 0; i < lists.Count; ++i)
{
  //since we want to compare sequecnes, we shall ensure the same order of the items
   var item = lists[i].OrderBy(x => x).ToArray();
   for (int j = lists.Count - 1; j > i; --j)
        if (item.SequenceEqual(lists[j].OrderBy(x => x)))
          lists.RemoveAt(j);
 }

I am using this in a bigger coding challenge and without Linq or syntactic sugars, I am trying to see if there is any elegant/fast solution ?

I am thinking just using a Hash but I am not sure what kind of Hashing function to use to identify that the List is already available?

More clearly For an input like

{{1,2,4, 4}, {3,4,5}, {4,2,1,4} }

Intermediate Output is [Sorted input/output is fine]

{{1,2,4,4}, {3,4,5}, {1,2,4,4} }

Output:

{{1,2,4,4}, {3,4,5}}

Community
  • 1
  • 1
Dexters
  • 2,419
  • 6
  • 37
  • 57
  • HashSet has method `SetEquals` which will check two sets much faster, but then you cant have duplicate numbers inside HashSet. is that a problem? – M.kazem Akhgary Jan 16 '17 at 23:22
  • @M.kazemAkhgary My lists inside can contain duplicates, i edited the sample now – Dexters Jan 16 '17 at 23:25
  • 1
    its really hard without Linq, what I was thinking is to group numbers, then put keys with their values inside dictionary to provide fast checking numbers, that will use much more space but it will be much more faster. – M.kazem Akhgary Jan 16 '17 at 23:30
  • It's hard without linq for sure. See my updated answer, I think I have caught up now. – Stuart Jan 16 '17 at 23:45
  • 2
    Obviously the people who implemented the LINQ `Distinct` sequence operator did so without being able to use LINQ's `Distinct` operator. If you want to know how they did it, read the source code. But it looks like you are doing some operation other than straight up `Distinct`. Nevertheless, it will probably give you some insights. – Eric Lippert Jan 16 '17 at 23:55
  • 1
    Put another way: can you write a solution that uses LINQ sequence operators? If so then you can write a solution without using LINQ sequence operators, because none of those operators are magic. They were all implemented by people like you. – Eric Lippert Jan 16 '17 at 23:57
  • Thanks, I have some idea now, I was using this in another problem quickly to check my correctness and then i realized this part may not be as trivial. I was having some doubts especially if such a hash can be constructed and most answers had linq in someway. – Dexters Jan 17 '17 at 00:16
  • Why the output you provided doesn't include duplicated items inside lists? Do you want them also be removed? – Yarik Jan 17 '17 at 00:46
  • @Yarik Thanks that was a mistake – Dexters Jan 17 '17 at 01:19

2 Answers2

2

I have used a modified version of the internals of CollectionAssert.AreEquivalent from Microsoft:

using System.Collections.Generic;

public class Program
{
    public static void Main()
    {
        var lists = new List<List<int>>
        {
            new List<int> {1, 4, 2},
            new List<int> {3, 4, 5},
            new List<int> {1, 2, 4}
        };

        var dedupe =
            new List<List<int>>(new HashSet<List<int>>(lists, new MultiSetComparer<int>()));
    }

    // Equal if sequence contains the same number of items, in any order
    public class MultiSetComparer<T> : IEqualityComparer<IEnumerable<T>>
    {
        public bool Equals(IEnumerable<T> first, IEnumerable<T> second)
        {
            if (first == null)
                return second == null;

            if (second == null)
                return false;

            if (ReferenceEquals(first, second))
                return true;

            // Shortcut when we can cheaply look at counts
            var firstCollection = first as ICollection<T>;
            var secondCollection = second as ICollection<T>;
            if (firstCollection != null && secondCollection != null)
            {
                if (firstCollection.Count != secondCollection.Count)
                    return false;

                if (firstCollection.Count == 0)
                    return true;
            }

            // Now compare elements
            return !HaveMismatchedElement(first, second);
        }

        private static bool HaveMismatchedElement(IEnumerable<T> first, IEnumerable<T> second)
        {
            int firstNullCount;
            int secondNullCount;

            // Create dictionary of unique elements with their counts
            var firstElementCounts = GetElementCounts(first, out firstNullCount);
            var secondElementCounts = GetElementCounts(second, out secondNullCount);

            if (firstNullCount != secondNullCount || firstElementCounts.Count != secondElementCounts.Count)
                return true;

            // make sure the counts for each element are equal, exiting early as soon as they differ
            foreach (var kvp in firstElementCounts)
            {
                var firstElementCount = kvp.Value;
                int secondElementCount;
                secondElementCounts.TryGetValue(kvp.Key, out secondElementCount);

                if (firstElementCount != secondElementCount)
                    return true;
            }

            return false;
        }

        private static Dictionary<T, int> GetElementCounts(IEnumerable<T> enumerable, out int nullCount)
        {
            var dictionary = new Dictionary<T, int>();
            nullCount = 0;

            foreach (T element in enumerable)
            {
                if (element == null)
                {
                    nullCount++;
                }
                else
                {
                    int num;
                    dictionary.TryGetValue(element, out num);
                    num++;
                    dictionary[element] = num;
                }
            }

            return dictionary;
        }

        public int GetHashCode(IEnumerable<T> enumerable)
        {
            int hash = 17;
            // Create and sort list in-place, rather than OrderBy(x=>x), because linq is forbidden in this question
            var list = new List<T>(enumerable);
            list.Sort();
            foreach (T val in list)
                hash = hash * 23 + (val == null ? 42 : val.GetHashCode());

            return hash;
        }
    }
}

This uses Hashset<T>, adding to this collection automatically ignores duplicates.

The last line could read:

var dedupe = new HashSet<List<int>>(lists, new MultiSetComparer<int>()).ToList();

Technically that uses the System.Linq namespace, but I don't think this is your concern with Linq.

I will echo what Eric Lippert has said. You are asking us to show you the raw workings of Linq and the framework internals, but it is not a closed box. Also if you are thinking that looking at the source code of these methods will reveal obvious inefficiencies and opportunities to optimize then I find quite often this not to be easy to spot, you are better off reading the docs and measuring.

Stuart
  • 5,358
  • 19
  • 28
  • OP wants to remove duplicate lists not duplicate items inside lists – M.kazem Akhgary Jan 16 '17 at 23:20
  • Apologies, it wasn't clear from the title, but was obvious from the example. I have now updated =) – Stuart Jan 16 '17 at 23:24
  • 1
    @Stuart Nice pattern to break functions from initial checks, can you add some comments, especially GetHashCode would be helpful. – Dexters Jan 17 '17 at 00:12
  • Comments added, the `GetHashCode` logic is a common pattern, [you can read more here](http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode) – Stuart Jan 17 '17 at 00:25
  • Why do you need to create a sorted list in `GetHashCode` when the order of the additions does not matter? I think that creating unnecessary allocations in the methods like `GetHashCode` is really bad idea. – Yarik Jan 17 '17 at 14:58
  • Exactly, so 2 different lists in different orders should produce the same hash code, that's why it's sorted. The original implementation uses `OrderBy`, this allocates too. Your implementation while simpler does not include this optimization Microsoft have put in, or the other fast paths. That being said, what was asked for was "elegant/fast", but it turns out `elegant != fast` – Stuart Jan 17 '17 at 15:08
  • @Stuart You don't need `OrderBy`, since arithmetic addition is commutative and associative. All 'fast path' optimizations are already implemented by HashSet's comparer (which I use), so there is no need to reinvent the wheel. – Yarik Jan 17 '17 at 15:22
  • arithmetic addition is, but it's addition with multiplication. – Stuart Jan 17 '17 at 15:26
  • @Stuart My fault - I haven't notice multiplication. I'm not sure why it's needed (HashSet's comparer uses XOR, which is also commutative and associative, and that's why they don't need to do sorting). – Yarik Jan 17 '17 at 15:39
  • I have benchmarked (very crudely) and side-by-side mine allocates less and for the data in your code takes half the time. I only allocation twice, once in the sort, and once for the Dictionary, turns out to be more efficient, although I agree, I would have expected allocations in `GetHashCode` to be bad for perf *shrug*. – Stuart Jan 17 '17 at 15:42
1

I think this would be much simpler than the accepted answer, and it doesn't use System.Linq namespace at all.

public class Program
{
    public static void Main()
    {
        IList<IList<int>> lists = new List<IList<int>>
        {
            new List<int> {1, 2, 4, 4},
            new List<int> {3, 4, 5},
            new List<int> {4, 2, 1, 4},
            new List<int> {1, 2, 2},
            new List<int> {1, 2},
        };

        // There is no Multiset data structure in C#, but we can represent it as a set of tuples,
        // where each tuple contains an item and the number of its occurrences.

        // The dictionary below would not allow to add the same multisets twice, while keeping track of the original lists.
        var multisets = new Dictionary<HashSet<Tuple<int, int>>, IList<int>>(HashSet<Tuple<int, int>>.CreateSetComparer());
        foreach (var list in lists)
        {
            // Count the number of occurrences of each item in the list.
            var set = new Dictionary<int, int>();
            foreach (var item in list)
            {
                int occurrences;
                set[item] = set.TryGetValue(item, out occurrences) ? occurrences + 1 : 1;
            }

            // Create a set of tuples that we could compare.
            var multiset = new HashSet<Tuple<int, int>>();
            foreach (var kv in set)
            {
                multiset.Add(Tuple.Create(kv.Key, kv.Value));
            }

            if (!multisets.ContainsKey(multiset))
            {
                multisets.Add(multiset, list);
            }
        }

        // Print results.
        foreach (var list in multisets.Values)
        {
            Console.WriteLine(string.Join(", ", list));
        }
    }
}

And the output will be:

1, 2, 4, 4
3, 4, 5
1, 2, 2
1, 2
Yarik
  • 1,423
  • 1
  • 14
  • 16