Fastest way to find duplicate items in a list in C#

Question

I know there are tons of similar questions on SO regarding this subject, but I couldn't quite find the answer I was looking for. Here's my requirement.

I have a long list of strings (easily upwards of 50,000 or even 100K items) in which I need to find the duplicate items. But just finding duplicates won't do; what I really want to do is go through the list and add an increment index at the end of each item to indicate the number of times an item repeats. To better illustrate let me take an example. My list actually contains paths, so the example roughly resembles that.

My original List:

AAA\BBB
AAA\CCC
AAA\CCC
BBB\XXX
BBB
BBB\XXX
BBB\XXX

My adjusted list with indices added:

AAA\BBB[1]
AAA\CCC[1]
AAA\CCC[2]
BBB\XXX[1]
BBB[1]
BBB\XXX[2]
BBB\XXX[3]

First I tried the following method using Linq:

List<string> originalList = new List<string>();
List<string> duplicateItems = new List<string>();

// pathList is a simple List<string> that contains my paths.
foreach (string item in pathList)
{
    // Do some stuff here and pick 'item' only if it fits some criteria.
    if (IsValid(item))
    {
        originalList.Add(item);
        int occurences = originalList.Where(x => x.Equals(item)).Count();
        duplicateItems.Add(item + "[" + occurences + "]");
    }
}

This works just fine and gives me the desired result. The problem is it's painfully slow given that my list can contain 100K items. So I looked around and learned that HashSet could be a possible alternative that's potentially more efficient. But I can't quite figure out how I would get my exact desired result using that.

I could try something like this, I guess:

HashSet<string> originalList = new HashSet<string>();
List<string> duplicateItems = new List<string>();

foreach (string item in pathList)
{
    // Do some stuff here and pick 'item' only if it fits some criteria.
    if (IsValid(item))
    {
        if (!originalList.Add(item))
        {
            duplicateItems.Add(item + "[" + ??? + "]");
        }
    }
}

Later I could add "[1]" to all items in the HashSet, but how do I get the indices right (marked by the universal sign of confusion, ???, above) when adding an item to my duplicate list? I can't keep a reference int that I can pass to my method as there could be hundreds of different repeating items, each repeating different number of times as in my example.

Could I still use HashSet, or is there a better way of accomplishing my goal? Even a slight pointer in the right direction would be a great help.

Preferably, yes. But I'd consider other alternatives too if it's not too slow and not too many lists. — Sach, Jul 13 '17 at 19:19
You can't use a `HashSet` for your original list because `HashSet` doesn't store duplicates. — itsme86, Jul 13 '17 at 19:21
@itsme86 Maybe there's a confusion about my choice of variable names. The list that contains duplicates is 'pathList'. The HashSet is 'originalList' and what I tried to do with HashSet is put unique items in it, and put duplicates in 'duplicateItems', and then later combine them. If I can get the indices right that is. — Sach, Jul 13 '17 at 19:25
@Trioj No, that's exactly what I did as shown in my first code example. It works but it's too slow so I'm looking at possible efficient alternatives. — Sach, Jul 13 '17 at 19:26
Thats not even remotely what you did in the code example I see. — Trioj, Jul 13 '17 at 19:27
@Trioj Yea sorry, my bad. I focused on the 'count' part and missed the 'groupby' part. Would it be considerably fast? Let me give it a try though. — Sach, Jul 13 '17 at 19:29
In all honesty, I have no idea how fast it will be -- I usually don't expect LINQ to be lightning, but I suspect this will be faster than the original code. I don't have a large enough record set to test on at the moment and verify. — Trioj, Jul 13 '17 at 19:31
This question might give you some inspiration: https://stackoverflow.com/questions/365615/in-net-which-loop-runs-faster-for-or-foreach — budi, Jul 13 '17 at 21:34

score 10 · Answer 1 · answered Jul 13 '17 at 19:41

10

Since you ask for fastest, the best IMO would be to use foreach loop and counting Dictionary<string, int>. It has the same time complexity as HashSet and uses much less memory than LINQ GroupBy:

var counts = new Dictionary<string, int>(pathList.Count); // specify max capacity to avoid rehashing
foreach (string item in pathList)
{
    // Do some stuff here and pick 'item' only if it fits some criteria.
    if (IsValid(item))
    {
        int count;
        counts.TryGetValue(item, out count);
        counts[item] = ++count;
        duplicateItems.Add(item + "[" + count + "]");
    }
}

answered Jul 13 '17 at 19:41

Ivan Stoev

195,425
15
312
343

Thanks, let me try this as well and compare to the other solution. – Sach Jul 13 '17 at 19:45
I ran the three answers by maccettura, Ivan Stoev, and Markus on a set of 116 lists, 10 times each. The averages come out to be as follows in milliseconds: maccettura = 13819, Ivan Stoev = 13809, Markus = 12966. So it apperas that they are all more or less the same. – Sach Jul 13 '17 at 20:55
The answer by Markus is just duplicate of mine. What about the LINQ `GroupBy` based solution, as I mentioned in the answer, the difference is in the used memory and GC pressure - if you have 100K unique items, the `GrouoBy` will allocate additionally 100K arrays. – Ivan Stoev Jul 13 '17 at 22:00
@IvanStoev @Sach it is interesting that the difference between the two approaches is almost a second though both use a dictionary. I thought yours should be faster because you initialized the size of the dictionary. But there are still differences between our answers. Maybe it's the ordinal string comparison I used in my approach, the iterator or the `string.Format`. – Markus Jul 14 '17 at 05:16
@Markus please do note though that the time measures above are quick and dirty .NET DateTime value measures done hastily and may not be the most accurate. So please do consider those values with this caveat. – Sach Jul 14 '17 at 19:53

score 4 · Accepted Answer · edited Jul 14 '17 at 02:05

4

You could try this, although I have not performance tested it yet:

List<string> originalList = new List<string>()
{
    @"AAA\BBB",
    @"AAA\CCC",
    @"AAA\CCC",
    @"BBB\XXX",
    @"BBB",
    @"BBB\XXX",
    @"BBB\XXX"
};
List<string> outputList = new List<string>();

foreach(var g in originalList.GroupBy(x => x).Select(x => x.ToList()))
{   
    var index = 1;  
    foreach(var item in g)
    {
        outputList.Add(string.Format("{0}[{1}]", item, index++));
    }
}

Fiddle here

edited Jul 14 '17 at 02:05

Prateek Pandey

833
6
19

answered Jul 13 '17 at 19:25

maccettura

10,514
3
28
35

1

Yeah, this looks closer to what I'd expect. – Trioj Jul 13 '17 at 19:26
1

Let me give it a go and report back. – Sach Jul 13 '17 at 19:30
Yeah as far as I can tell the OP wanted the count to increment for every duplicate item. If I just added `Count()` not only would it iterate through every group but it would only give the total, not an increment – maccettura Jul 13 '17 at 19:40
*IF* OP does *NOT* need the incrementing counts, then I'd go with something like var groups = list.GroupBy(x => x).Select(g => string.Format("{0}[{1}]", g.Key, g.Count())); – Trioj Jul 13 '17 at 19:41
2

Yeah I do need the incrementing count. I did try @maccettura's solution and it works pretty well; a list of 76131 items done in under 1 second whereas Linq takes way too longer. – Sach Jul 13 '17 at 19:41
1

Nice. I like this implementation given the requirements. – Trioj Jul 13 '17 at 19:42
1

I ran the three answers by maccettura, Ivan Stoev, and Markus on a set of 116 lists, 10 times each. The averages come out to be as follows in milliseconds: maccettura = 13819, Ivan Stoev = 13809, Markus = 12966. So it apperas that they are all more or less the same. – Sach Jul 13 '17 at 20:56

score 1 · Answer 3 · answered Jul 13 '17 at 19:48

1

What about this?

    static IEnumerable<string> MyCounter(IEnumerable<string> data)
    {
        var myDic = new Dictionary<string, int>();
        foreach (var d in data)
        {
            if (!myDic.ContainsKey(d))
                myDic[d] = 1;
            else
                myDic[d] = myDic[d] + 1 ;
            yield return d +"[" + myDic[d] + "]";
        }
    }

answered Jul 13 '17 at 19:48

Jonathan Nappee

430
2
11

Thanks. His example is 1 based no? – Jonathan Nappee Jul 13 '17 at 19:59
I have no idea what I was thinking. Sorry, haha – maccettura Jul 13 '17 at 20:06

score 1 · Answer 4 · answered Jul 13 '17 at 19:53

You could iterate over the list and use a dictionary to get the count, like this:

private int GetCount(IDictionary<string, int> counts, string item)
{
  int count;
  if (!counts.TryGetValue(item, out count))
    count = 0;
  count++;
  counts[item] = count;
  return count;
}

private IEnumerable<string> GetItems(IEnumerable<string> items)
{
  // Initialize dict for counts with appropriate comparison
  var counts = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
  foreach(var item in items)
    yield return string.Format("{0}[{1}]", item, GetCount(counts, item));
}

I ran the three answers by maccettura, Ivan Stoev, and Markus on a set of 116 lists, 10 times each. The averages come out to be as follows in milliseconds: maccettura = 13819, Ivan Stoev = 13809, Markus = 12966. So it apperas that they are all more or less the same. — Sach, Jul 13 '17 at 20:56

score 0 · Answer 5 · answered Jul 13 '17 at 19:26

0

You could just use Group() to pull the strings together and then project those groups using a combination of value and count.

Given your list of strings:

var listOfStrings;
var grouped = listOfStrings.GroupBy(x => x);
var groupedCount = grouped.Select(x => new {key = x.Key, count = group.Count()});

answered Jul 13 '17 at 19:26

Jesse Carter

20,062
7
64
101

Wouldn't this give a total count for each element? Not an incrementing one like OP wants? – maccettura Jul 13 '17 at 19:33
Yeah, I missed that in my implementation too. – Trioj Jul 13 '17 at 19:39

score 0 · Answer 6 · answered Jul 14 '17 at 01:11

You can use this crisp and crunchy code:

public static void Main()
{
    var originalList  = new List<string>()
    {
        @"AAA\BBB",
        @"AAA\CCC",
        @"AAA\CCC",
        @"BBB\XXX",
        @"BBB",
        @"BBB\XXX",
        @"BBB\XXX"
    };

    var outputList = originalList.GroupBy(x => x).SelectMany(x => x.Select((y, i) => string.Format("{0}[{1}]", y, i + 1)));     

    Console.WriteLine(string.Join("\n", outputList));
}

So how is this the "fastest way"? – Gert Arnold Mar 21 '22 at 19:15 — Gert Arnold, Mar 21 '22 at 19:15

Justin Cook · Answer 7 · 2022-03-21T17:44:19.113

Using a HashSet

Note: Dump() is a LinqPad method that prints the results to the screen — substitute as necessary.

void Main()
{
    var list = new List<string> {"hello", "doctor", "name", "continue", "yesterday", "tomorrow", "HELLO"};
    
    //case-insensitive string compare
    list.HasDuplicates(StringComparer.OrdinalIgnoreCase).Dump();

    //case-sensitive string compare
    list.HasDuplicates().Dump();

    //integer compare
    var list2 = new List<int> { 1,2,3,4,5,2 };
    list2.HasDuplicates().Dump();
}

public static class Test
{
    public static bool HasDuplicates<T>(this IList<T> list, StringComparer stringComparer = null)
    {
        if (typeof(T) == typeof(string))
        {
            var hash = new HashSet<string>(list.Count, stringComparer);
            foreach (var val in list) if (!hash.Add(val?.ToString())) break;
            return hash.Count != list.Count;
        }
        else
        {
            var hash = new HashSet<T>(list.Count);
            foreach (var val in list) if (!hash.Add(val)) break;
            return hash.Count != list.Count;
        }
    }
}

/*

output:

True
False
True
*/

They don't ask for *a* method, but the fastest method. A new answer should show a comparison with the existing answers otherwise it doesn't answer the question. — Gert Arnold, Mar 21 '22 at 19:10

Fastest way to find duplicate items in a list in C#

7 Answers7