2

I'm having a hard time deduping a list based on a specific delimiter.

For example I have 4 strings like below:

apple|pear|fruit|basket
orange|mango|fruit|turtle
purple|red|black|green
hero|thor|ironman|hulk

In this example I should want my list to only have unique values in column 3, so it would result in an List that looks like this,

apple|pear|fruit|basket
purple|red|black|green
hero|thor|ironman|hulk

In the above example I would have gotten rid of line 2 because line 1 had the same result in column 3. Any help would be awesome, deduping is tough in C#.

how i'm testing this:

    static void Main(string[] args)
    {
        BeginListSet = new List<string>();
        startHashSet();
    }


    public static List<string> BeginListSet { get; set; }

    public static void startHashSet()
    {
        string[] BeginFileLine = File.ReadAllLines(@"C:\testit.txt");
        foreach (string begLine in BeginFileLine)
        {

            BeginListSet.Add(begLine);
        }

    }

    public static IEnumerable<string> Dedupe(IEnumerable<string> list, char seperator, int keyIndex)
    {
        var hashset = new HashSet<string>();
        foreach (string item in list)
        {
            var array = item.Split(seperator);
            if (hashset.Add(array[keyIndex]))
                yield return item;
        }
    }
Anthony Pegram
  • 123,721
  • 27
  • 225
  • 246

5 Answers5

6

Something like this should work for you

static IEnumerable<string> Dedupe(this IEnumerable<string> input, char seperator, int keyIndex)
{
    var hashset = new HashSet<string>();
    foreach (string item in input)
    {
        var array = item.Split(seperator);
        if (hashset.Add(array[keyIndex]))
            yield return item;
    }
}

...

var list = new string[] 
{
    "apple|pear|fruit|basket", 
    "orange|mango|fruit|turtle",
    "purple|red|black|green",
    "hero|thor|ironman|hulk"
};

foreach (string item in list.Dedupe('|', 2))
    Console.WriteLine(item);

Edit: In the linked question Distinct() with Lambda, Jon Skeet presents the idea in a much better fashion, in the form of a DistinctBy custom method. While similar, his is far more reusable than the idea presented here.

Using his method, you could write

var deduped = list.DistinctBy(item => item.Split('|')[2]);

And you could later reuse the same method to "dedupe" another list of objects of a different type by a key of possibly yet another type.

Community
  • 1
  • 1
Anthony Pegram
  • 123,721
  • 27
  • 225
  • 246
  • Nice. was just typing that up as you posted. – dlev Aug 12 '11 at 02:58
  • When i try to type in list.Dedupe, Intellisense doesn't detected the Dedupe. I tried Dedupe(BeginHashSet, '|', 2); but then i get Extension method must be defined in a non-generic static class. In my test i have everything being read into a HashSet First –  Aug 12 '11 at 12:52
  • @Mike, read up on [extension methods](http://msdn.microsoft.com/en-us/library/bb383977.aspx) if you are not familiar with them, but the quick takeaway is the extension methods must be static methods defined in a static class. You can convert this to a "regular" instance or static method as well, simply drop the `this` modifier before the first parameter and optionally drop the `static` modifier before the method return type. – Anthony Pegram Aug 12 '11 at 13:32
  • You also do not need to load your strings into a `HashSet` first, unless that just happens to be how you're originally storing them. You can use an array, generic list, etc. – Anthony Pegram Aug 12 '11 at 13:33
  • HashSet is how I am originally storing it so when I do Dedupe(BeginHashSet, '|', 0); it actually never enters the code when i debug. Dedupe(BeginHashSet, ',', 0); but if i try list.Dedupe it says does not contain .Dedupe –  Aug 12 '11 at 14:55
  • i've made an edit tot he original question, maybe you can see why i'm having issues –  Aug 12 '11 at 15:00
  • @Mike, 1) The code in your question is storing the lines in a `List`, not a HashSet (not that it matters). 2) The code never calls Dedupe. You may have posted the wrong snippet. – Anthony Pegram Aug 12 '11 at 15:11
  • Sorry for the confusion Anthony, i've tried the BeginListSet.Dedupe in your example and this wont compile. and then i tried Dedupe(BeginListSet ,'|',2) and this never actually enters Dedupe when i have my breakpoints in –  Aug 12 '11 at 17:23
  • @Mike - When you call `Dedupe`, do you try to iterate over the result? This is an example of an iterator block, it is lazily evaluated. Try to loop over the result of the method or optionally invoke `ToList()` on the result. – Anthony Pegram Aug 12 '11 at 18:32
0

Can you use a HashSet instead? That will eliminate dupes automatically for you as they are added.

kprobst
  • 16,165
  • 5
  • 32
  • 53
  • I would have used a HashSet, but i didnt think it would work in this situation because the entire entry in the hashset would be different then the key I was using. –  Aug 12 '11 at 03:16
0

May be you can sort the words with delimited | on alphabetical order. Then store them onto grid (columns). Then when you try to insert, just check if there is column having a word which starting with this char.

Zenwalker
  • 1,883
  • 1
  • 14
  • 27
0

Try this:

var list = new string[]
                    {
                        "apple|pear|fruit|basket",
                        "orange|mango|fruit|turtle",
                        "purple|red|black|green",
                        "hero|thor|ironman|hulk "
                    };

var dedup  = new List<string>();
var filtered = new List<string>();
foreach (var s in list)
{
    var filter = s.Split('|')[2];
    if (dedup.Contains(filter)) continue;
    filtered.Add(s);
    dedup.Add(filter);
}


// Console.WriteLine(filtered);
Mrchief
  • 75,126
  • 20
  • 142
  • 189
0

If LINQ is an option, you can do something like this:

// assume strings is a collection of strings
List<string> list = strings.Select(a => a.Split('|')) // split each line by '|'
   .GroupBy(a => a[2])  // group by third column
   .Select(a => a.First()) // select first line from each group
   .Select(a => string.Join("|", a))
   .ToList(); // convert to list of strings

Edit (per Jeff Mercado's comment), this can be simplified further:

List<string> list = 
   strings.GroupBy(a => a.split('|')[2])  // group by third column
   .Select(a => a.First()) // select first line from each group
   .ToList(); // convert to list of strings
drf
  • 8,461
  • 32
  • 50
  • The first projection isn't really necessary at all. Just do the grouping performing the split and taking the third column at once. – Jeff Mercado Aug 12 '11 at 03:06
  • Thanks, that simplifies it a lot. I edited the comment to show without the redundant projection. – drf Aug 12 '11 at 03:22