Best way to find out distinct item in the big list

Question

I have a following collection, it has more than 500000 items in it.

List<Item> MyCollection = new List<Item>();

and type:

class Item
{
   public string Name { get; set; }
   public string Description { get; set; }
}

I want to return a list of items having distinct Name. i.e. to find out distinct item based on name.

What are the possible ways & which would be best in terms of time & memory. Although both are important however less time has more priority over memory.

Does [`Enumerable.Distinct()`](http://msdn.microsoft.com/en-us/library/system.linq.enumerable.distinct.aspx) not do what you want? Or do you want a list of just the items that were unique in the list (which is different from what `Distinct()` does)? — Matthew Watson, Jul 24 '13 at 08:03
possible duplicate of [Faster alternatives to .Distinct()](http://stackoverflow.com/questions/5970983/faster-alternatives-to-distinct) — George Duckett, Aug 01 '13 at 11:00

score 4 · Answer 1 · answered Jul 24 '13 at 08:04

4

I would opt for Linq, unless or until the performance turns out to be insufficient:

var considered = from i in MyCollection
         group i by i.Name into g
         select new { Name = g.Key, Cnt = g.Count(), Instance = g.First() };
var result = from c in considered where c.Cnt == 1 select c.Instance;

(Assuming I've interpreted your question correctly as "return those items whose Name only appears once in the list")

answered Jul 24 '13 at 08:04

Damien_The_Unbeliever

234,701
27
340
448

That's how I interpreted the question, but it's a little unclear! – Matthew Watson Jul 24 '13 at 08:07

score 2 · Answer 2 · answered Jul 24 '13 at 08:08

i am having java version of the code

implement the comparator then define the method as below in Item class

public int compare(MyObject o1, MyObject o2)
{
   // return 0 if objects are equal in terms of your data members such as name or any
}

Then use the below code in the class in which MyCollection is defined

   HashSet<Item> set1 = new HashSet<Item>();
   set1.addAll(MyCollection);
   MyCollection.clear();
   MyCollection.addAll(set1);

This will give you the sorted set

score 1 · Accepted Answer · edited Jul 24 '13 at 08:10

1

You can sort your list an then delete all repeated items, But seems that storing all data in a Dictionary<string, string> would be better for this task. Or maybe even put all the list in a HashSet.

edited Jul 24 '13 at 08:10

Sergey Berezovskiy

232,247
41
429
459

answered Jul 24 '13 at 08:01

Sergio

6,900
5
31
55

@lazyberezovsky why not? class item contains two string fields. `Name` could be key and `Description` is a value, just fits this case – Sergio Jul 24 '13 at 08:53
Actually there was problem with distinct items. Thus I thought you have several items with same name, and appropriate type would be `Dictionary>` (or Lookup). But if answer solved problem, then it's of course correct +1 – Sergey Berezovskiy Jul 24 '13 at 09:12

score 1 · Answer 4 · answered Jul 24 '13 at 08:05

1

MoreLinq has a DistinctBy extension that is great for this sort of thing, its open source and just a few lines of code so easy to add to your code.

var results = MyCollection.DistinctBy(p => p.Name);

answered Jul 24 '13 at 08:05

sa_ddam213

42,848
7
101
110

score 1 · Answer 5 · answered Jul 24 '13 at 08:43

I can see you found your answer, but you can also do it fairly simply using Distinct;

internal class NameComparer : IEqualityComparer<Item> {
    public bool Equals(Item x, Item y) { return x.Name == y.Name;     }
    public int GetHashCode(Item obj) { return obj.Name.GetHashCode(); }
}

var distinctItems = MyCollection.Distinct(new NameComparer());

score 0 · Answer 6 · answered Jul 24 '13 at 08:04

First solution:

public static IEnumerable<T> DistinctBy<T, TKey>(this IEnumerable<T> sequence, Func<T, TKey> keySelector)
{
    var alreadyUsed = new HashSet<TKey>();            
    foreach (var item in sequence)
    {
        var key = keySelector(item);
        if (alreadyUsed.Add(key))
        {
            yield return item;
        }
    }
}

Second is to use .Distinct() and override Equals in your item to match name

Best way to find out distinct item in the big list

6 Answers6