I know there are tons of similar questions on SO regarding this subject, but I couldn't quite find the answer I was looking for. Here's my requirement.
I have a long list of strings (easily upwards of 50,000 or even 100K items) in which I need to find the duplicate items. But just finding duplicates won't do; what I really want to do is go through the list and add an increment index at the end of each item to indicate the number of times an item repeats. To better illustrate let me take an example. My list actually contains paths, so the example roughly resembles that.
My original List:
AAA\BBB
AAA\CCC
AAA\CCC
BBB\XXX
BBB
BBB\XXX
BBB\XXX
My adjusted list with indices added:
AAA\BBB[1]
AAA\CCC[1]
AAA\CCC[2]
BBB\XXX[1]
BBB[1]
BBB\XXX[2]
BBB\XXX[3]
First I tried the following method using Linq:
List<string> originalList = new List<string>();
List<string> duplicateItems = new List<string>();
// pathList is a simple List<string> that contains my paths.
foreach (string item in pathList)
{
// Do some stuff here and pick 'item' only if it fits some criteria.
if (IsValid(item))
{
originalList.Add(item);
int occurences = originalList.Where(x => x.Equals(item)).Count();
duplicateItems.Add(item + "[" + occurences + "]");
}
}
This works just fine and gives me the desired result. The problem is it's painfully slow given that my list can contain 100K items. So I looked around and learned that HashSet could be a possible alternative that's potentially more efficient. But I can't quite figure out how I would get my exact desired result using that.
I could try something like this, I guess:
HashSet<string> originalList = new HashSet<string>();
List<string> duplicateItems = new List<string>();
foreach (string item in pathList)
{
// Do some stuff here and pick 'item' only if it fits some criteria.
if (IsValid(item))
{
if (!originalList.Add(item))
{
duplicateItems.Add(item + "[" + ??? + "]");
}
}
}
Later I could add "[1]" to all items in the HashSet, but how do I get the indices right (marked by the universal sign of confusion, ???, above) when adding an item to my duplicate list? I can't keep a reference int that I can pass to my method as there could be hundreds of different repeating items, each repeating different number of times as in my example.
Could I still use HashSet, or is there a better way of accomplishing my goal? Even a slight pointer in the right direction would be a great help.