1

after reading this very interesting thread on duplicate removal, i ended with this =>

    public static IEnumerable<T> deDuplicateCollection<T>(IEnumerable<T> input)
    {
        var hs = new HashSet<T>();
        foreach (T t in input)
            if (hs.Add(t))
                yield return t;
    }        

by the way, as i'm brand new to C# and coming from Python, i'm a bit lost between casting and this kind of thing... i was able to compile and build with :

            foreach (KeyValuePair<long, List<string>> kvp in d)
            {
                d[kvp.Key] = (List<string>) deDuplicateCollection(kvp.Value);
            }

but i must have missed something here... as i get a "System.InvalidCastException" @ runtime, maybe could you point interesting things about casting and where i'm wrong? Thank you in advance.

Community
  • 1
  • 1
lctv31
  • 119
  • 3
  • 12

2 Answers2

3

First, about the usage of the method.

Drop the cast, invoke ToList() on the result of the method. The result of the method is IEnumerable<string>, this is not a List<string>. The fact the source is originally a List<string> is irrelevant, you don't return the list, you yield return a sequence.

d[kvp.Key] = deDuplicateCollection(kvp.Value).ToList();

Second, your deDuplicateCollection method is redundant, Distinct() already exists in the library and performs the same function.

d[kvp.Key] = kvp.Value.Distinct().ToList();

Just be sure you have a using System.Linq; in the directives so you can use these Distinct() and ToList() extension methods.

Finally, you'll notice making this change alone, you run into a new exception when trying to change the dictionary in the loop. You cannot update the collection in a foreach. The simplest way to do what you want is to omit the explicit loop entirely. Consider

d = d.ToDictionary(kvp => kvp.Key, kvp => kvp.Value.Distinct().ToList());

This uses another Linq extension method, ToDictionary(). Note: this creates a new dictionary in memory and updates d to reference it. If you need to preserve the original dictionary as referenced by d, then you would need to approach this another way. A simple option here is to build a dictionary to shadow d, and then update d with it.

var shadow = new Dictionary<string, string>();
foreach (var kvp in d)
{ 
    shadow[kvp.Key] = kvp.Value.Distinct().ToList();
}

foreach (var kvp in shadow)
{
    d[kvp.Key] = kvp.Value;
}

These two loops are safe, but you see you need to loop twice to avoid the problem of updating the original collection while enumerating over it while also preserving the original collection in memory.

Anthony Pegram
  • 123,721
  • 27
  • 225
  • 246
  • Is this really supposed to assign a list as the key? – Welton v3.62 Oct 18 '11 at 13:31
  • @Weltonv3.51, this does not assign a list *as* the key, it associates the newly distinct list as the `.Value` *for* the pair containing the `.Key`, as in the original code snippet in the question. – Anthony Pegram Oct 18 '11 at 13:32
  • thank you, i think i get the point now. If i readed well, Distinct was coming with recent .Net releases, and was looking for something generic in first step... And also wanted to keep order (forgot to write this in my first) (well, by the way, i get another exception as i cannot change a dictionary while wlaking/interating on it...) – lctv31 Oct 18 '11 at 13:33
  • @user1001170, `Distinct()` will preserve the order, it will be a near carbon-copy of the method you wrote, although it has some overloads that allow you to specify a comparer if default equality is insufficient. Also, I thought about specifying you need .NET 3.5+ to use these methods, but the same is true for `HashSet`, it is also new with 3.5. – Anthony Pegram Oct 18 '11 at 13:34
  • As for your exception, I'll update. It's true, you'll run into that because of the foreach. – Anthony Pegram Oct 18 '11 at 13:35
  • @user1001170 "recent" as in "2007-11-19" if the wiki is right. It was introduced in 3.5 – xanatos Oct 18 '11 at 13:35
  • ok, so as an alternative to hashset i should have used a dictionary, ContainsKey and Add : (as 'var' looks like to be also a 3.5 feature) Dictionary d = new Dictionary(); foreach(T t in imput) if (!d.ContainsKey(t)) { d.Add(t); yield return t; } – lctv31 Oct 18 '11 at 13:43
  • `var` is a keyword for type inference, the actual variable will be strongly typed as always, it's just it saves you keystrokes. Compile time, there's no difference, and you can still see the type in intellisence by hovering. – Anthony Pegram Oct 18 '11 at 13:45
  • As for your new approach, there's nothing necessarily wrong with exploring what you can do with code, but try not to reinvent the wheels the library already provides for you. If you're interested in seeing more about how Linq could work under the cover and maybe how you might learn something for your own code, consider reading Jon Skeet's "Edulinq" series. Look for it in google. – Anthony Pegram Oct 18 '11 at 13:46
2
d[kvp.Key] = kvp.Value.Distinct().ToList();

There is already a Distinct extension method to remove duplicates!

xanatos
  • 109,618
  • 12
  • 197
  • 280