Best Algorithm for Removing Duplicate Values from a List

Question

What is best algorithm for removing duplicate values from a list ? I've tried this:

for (int i = 0; i < AuthorCounter-1; i++)
{
    for (int j = 0; j < AuthorCounter-1; j++)
    {
        if (i != j)
        {
            if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[j].Text)
            {
                AuthorGroupNode.Nodes[j].Remove();
                AuthorCounter--;
            }

        }
    }
}

Here, AuthorGroupNodes is a list on nodes. It did things right to some extent, but not perfect. Any one have better solution ???

Best based on what criteria? Lowest CPU, lowest memory, ease of coding? — Eric J., Jul 17 '12 at 04:25
Do you need order preserved or not? There are many other StackOverflow questions on removing duplicates; most mention the criteria. OBTW for ease of coding: `list.Distinct().toList()` but that's probably not the fastest. — Ray Toal, Jul 17 '12 at 04:26
Possible duplicate: http://stackoverflow.com/questions/1388361/getting-all-unique-items-in-a-c-sharp-list — Ray Toal, Jul 17 '12 at 04:30
The .NET Version is 3.5 and Best means having Lowest CPU processing + Memory. — Itz.Irshad, Jul 17 '12 at 04:34
Lowest CPU and lowest memory are sometimes at odds with each other. — Eric J., Jul 17 '12 at 04:38
"Best" can be anything, but "Best Algorithm" always have to be about the fastest way to get a thing done. Surprised by OP's comment — nawfal, Jul 17 '12 at 04:38
how many members in the list? I do not know if you would really need to worry on spec bottleneck here, rather speed should be the concern, provided u have tens of thousands of items in that list — nawfal, Jul 17 '12 at 04:40
@nawfal: If you have a memory-constrained environment, "best" could be an algorithm that gets the job done in-place, without allocating more memory, even if the CPU cost is higher. — Eric J., Jul 17 '12 at 05:13
@Ray: Actually .Distinct().ToList() is O(n). See my answer for details and for Jon Skeet's re-implementation of .Distinct. — Eric J., Jul 17 '12 at 05:13

score 6 · Answer 1 · edited Jan 24 '17 at 17:18

Your current algorithm is O(N-squared), which will perform quite poorly for a large list.

If space is not an issue, you could keep a HashSet<int> of hashes of nodes. Traverse the list once. If the hash of the node is in the HashSet, you know this is a duplicate node. Skip it. If the hash is not in the HashSet, add this node to a new list, and add the hash of the node to the HashSet.

This will perform O(N), and requires memory for the original list, for a copy of the list less any duplicates, and for the HashSet. The algorithm is non-destructive.

If you can use Linq, simply do

var distinctList = originalList.Distinct().ToList();

UPDATE

Discovered that's pretty much exactly how Jon Skeet re-implemented Distinct.

public static IEnumerable<TSource> Distinct<TSource>( 
    this IEnumerable<TSource> source) 
{ 
    return source.Distinct(EqualityComparer<TSource>.Default); 
} 

public static IEnumerable<TSource> Distinct<TSource>( 
    this IEnumerable<TSource> source, 
    IEqualityComparer<TSource> comparer) 
{ 
    if (source == null)  
    { 
        throw new ArgumentNullException("source"); 
    } 
    return DistinctImpl(source, comparer ?? EqualityComparer<TSource>.Default); 
} 

private static IEnumerable<TSource> DistinctImpl<TSource>( 
    IEnumerable<TSource> source, 
    IEqualityComparer<TSource> comparer) 
{ 
    HashSet<TSource> seenElements = new HashSet<TSource>(comparer); 
    foreach (TSource item in source) 
    { 
        if (seenElements.Add(item)) 
        { 
            yield return item; 
        } 
    } 
}

https://codeblog.jonskeet.uk/2010/12/30/reimplementing-linq-to-objects-part-14-distinct/

score 4 · Answer 2 · answered Jul 17 '12 at 04:52

4

This works like a treat:

var xs = new []
{
    2, 3, 2, 4, 3, 3, 5, 6,
};

var ys = xs
    .ToLookup(z => z, z => z)
    .Select(x => x.First());

For your code it would look something like this:

var nodes = AuthorGroupNode.Nodes
    .ToLookup(z => z.Text, z => z)
    .Select(x => x.First())
    .ToArray();

Can't be much simpler than that. :-)

answered Jul 17 '12 at 04:52

Enigmativity

113,464
11
89
172

@EricJ. - `Distinct` won't work on custom objects unless they implement `Equals` & `GetHashCode` which we don't know for these node objects. – Enigmativity Jul 17 '12 at 06:38
@Enigmativity What does a tolookup return?Is this going to be faster than hashset approach? – nawfal Jul 17 '12 at 08:42
@nawfal - It returns a `System.Linq.ILookup` - which is essentially an `IEnumerable>` but keyed by `T1`. – Enigmativity Jul 17 '12 at 10:41
1

@nawfal - It also uses a hashset internally (from memory) so it is quite fast. What matters though is if it makes the code clear and if it is fast enough. – Enigmativity Jul 17 '12 at 11:23
@Enigmativity thanks, I get it. I was more interested in speed than code clarity as was my requirement. – nawfal Jul 17 '12 at 11:33
@nawfal - You asked for "best", not "fastest". Depends how you approach "best" doesn't it. :-) – Enigmativity Jul 17 '12 at 11:35

score 3 · Answer 3 · answered Jul 17 '12 at 04:34

Piggy backing off of Eric J.'s answer... You'll want to implement an EqualityComparer to have complete control over how distinct items are identified.

class Program
{
    static void Main(string[] args)
    {
        var list = new List<SampleClass>();
        // add some items

        var distinctItems = list.Distinct(new SampleClass());
    }
}

public class SampleClass : EqualityComparer<SampleClass>
{
    public string Text { get; set; }

    public override bool Equals(SampleClass x, SampleClass y)
    {
        if (x == null || y == null) return false;
        return x.Text == y.Text;
    }

    public override int GetHashCode(SampleClass obj)
    {
        if (obj == null) return 0;
        if (obj.Text == null) return 0;
        return obj.Text.GetHashCode();
    }
}

more info: http://msdn.microsoft.com/en-us/library/bb338049

score 2 · Answer 4 · answered Jul 17 '12 at 04:42

You never check the last element of the list, your second for needs to be changed to this to work:

for (int j = 0; j < AuthorCounter; j++)

You are checking each pair of nodes twice. First you'll check when i = 0 and j = 1, then later you'll check when i = 1 and j = 0. There's no need to start j before or equal to i. When i = 0, your inner loop will remove all duplicates of that element so you know AuthorGroupNodes.Nodes[0] is unique. Next time through the outer loop you will be sure that AuthorGroupNodes.Nodes[1] is unique. You can therefore start with j equal to i + 1 and remove your check for i == j. Also when you remove the node, j will still increase to the next node. This will skip the new node at j which is the one after the one you remove, so you should decrement j, or just increment j if you don't remove the node:

for (int j = i + 1; j < AuthorCounter;)
{
    if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[j].Text)
    {
        AuthorGroupNode.Nodes[j].Remove();
        AuthorCounter--;
    }
    else
    {
        j++;
    }
}

You say that works but not perfect, so I'm assuming you're not using a standard List and that your nodes handle their own removal from the list with the Remove() method.

If the list is sorted by the field you're comparing, you could remove the inner for loop altogether and remove any duplicates of your current element until you find a different one:

for (int i = 0; i < AuthorCounter-1;)
{
    if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[i + 1].Text)
    {
        AuthorGroupNode.Nodes[i].Remove();
        AuthorCounter--;
    }
    else
    {
        i++;
    }
}

oh my GOD so many nice answers here, gonna help me in my project +fav! :) — nawfal, Jul 17 '12 at 08:41

Best Algorithm for Removing Duplicate Values from a List

4 Answers4

Linked