4

I have a list of objects. These objects are made up of a custom class that basically contains two string fields String1 and String2.

What I need to know is if any of these strings are duplicated in that list. So I want to know if objectA.String1 == objectB.String1, or ObjectA.String2 == ObjectB.String2, or ObjectA.String1 == ObjectB.String", or ObjectA.String2 == ObjectB.String1.

Also, I want to mark each object that contains a duplicate string as having a duplicate string (with a bool HasDuplicate on the object).

So when the duplication detection has run I want to simply foreach over the list like so:

foreach (var item in duplicationList)
    if (item.HasDuplicate)
        Console.WriteLine("Duplicate detected!");

This seemd like a nice problem to solve with LINQ, but I cannot for the life of me figure out a good query. So I've solved it using 'good-old' foreach, but I'm still interested in a LINQ version.

AustinWBryan
  • 3,249
  • 3
  • 24
  • 42
Jeroen-bart Engelen
  • 1,157
  • 2
  • 12
  • 19

5 Answers5

12

Here's a complete code sample which should work for your case.

class A
{
    public string Foo   { get; set; }
    public string Bar   { get; set; }
    public bool HasDupe { get; set; }
}

var list = new List<A> 
          { 
              new A{ Foo="abc", Bar="xyz"}, 
              new A{ Foo="def", Bar="ghi"}, 
              new A{ Foo="123", Bar="abc"}  
          };

var dupes = list.Where(a => list
          .Except(new List<A>{a})
          .Any(x => x.Foo == a.Foo || x.Bar == a.Bar || x.Foo == a.Bar || x.Bar == a.Foo))
          .ToList();

dupes.ForEach(a => a.HasDupe = true);
AustinWBryan
  • 3,249
  • 3
  • 24
  • 42
Winston Smith
  • 21,585
  • 10
  • 60
  • 75
  • 2
    LINQPad is a great tool for figuring out problems like this - every C# developer should have a copy. – Winston Smith Dec 16 '09 at 11:05
  • Nice. Just one point - I would think that moving the logic from Except into the Any method would be a little more efficient because it will save the creation of a List for every element being checked, e.g. var dupes = list.Where( a => list .Any(a != x && (x => x.Foo == a.Foo || x.Bar == a.Bar || x.Foo == a.Bar || x.Bar == a.Foo)) ).ToList(); – Mike Goatly Aug 28 '12 at 08:42
  • Keep in mind that empty strings are also equal to each other so test for that if you want to ignore empties. – CAD bloke May 05 '15 at 02:21
  • Please, remove the ForEach extension method. It's discouraged and it no longer exists. – SuperJMN Nov 06 '15 at 19:56
5

This should work:

public class Foo
{
    public string Bar;
    public string Baz;
    public bool HasDuplicates;
}

public static void SetHasDuplicate(IEnumerable<Foo> foos)
{
    var dupes = foos
        .SelectMany(f => new[] { new { Foo = f, Str = f.Bar }, new { Foo = f, Str = f.Baz } })
        .Distinct() // Eliminates double entries where Foo.Bar == Foo.Baz
        .GroupBy(x => x.Str)
        .Where(g => g.Count() > 1)
        .SelectMany(g => g.Select(x => x.Foo))
        .Distinct()
        .ToList();

    dupes.ForEach(d => d.HasDuplicates = true);    
}

What you are basically doing is

  1. SelectMany : create a list of all the strings, with their accompanying Foo
  2. Distinct : Remove double entries for the same instance of Foo (Foo.Bar == Foo.Baz)
  3. GroupBy : Group by string
  4. Where : Filter the groups with more than one item in them. These contain the duplicates.
  5. SelectMany : Get the foos back from the groups.
  6. Distinct : Remove double occurrences of foo from the list.
  7. ForEach : Set the HasDuplicates property.

Some advantages of this solution over Winston Smith's solution are:

  1. Easier to extend to more string properties. Suppose there were 5 properties. In his solution, you would have to write 125 comparisons to check for duplicates (in the Any clause). In this solution, it's just a matter of adding the property in the first selectmany call.
  2. Performance should be much better for large lists. Winston's solution iterates over the list for each item in the list, while this solution only iterates over it once. (Winston's solution is O(n²) while this one is O(n)).
Geert Baeyaert
  • 293
  • 1
  • 4
  • does Grouping lazy evaluate its group members? g.Skip(1).Any() might be an improvement over g.Count() > 1 – Jimmy Dec 17 '09 at 00:18
  • @Jimmy It doesn't really matter in this case, because the groups are not lazily evaluated. I do like the Skip(1).Any() trick though. For my own projects, I always have extensions methods CountIs(int expected), CountIsGreaterThan(int expected)... which stop evaluating as soon as they know the answer. – Geert Baeyaert Dec 17 '09 at 09:14
0

First, if your object doesn't have the HasDuplicate property yet, declare an extension method that implements HasDuplicateProperties:

public static bool HasDuplicateProperties<T>(this T instance)
    where T : SomeClass 
    // where is optional, but might be useful when you want to enforce
    // a base class/interface
{
    // use reflection or something else to determine wether this instance
    // has duplicate properties
    return false;
}

You can use that extension method in queries:

var itemsWithDuplicates = from item in duplicationList
                          where item.HasDuplicateProperties()
                          select item;

Same works with the normal property:

var itemsWithDuplicates = from item in duplicationList
                          where item.HasDuplicate
                          select item;

or

var itemsWithDuplicates = duplicationList.Where(x => x.HasDuplicateProperties());
Sander Rijken
  • 21,376
  • 3
  • 61
  • 85
  • That's not my question. I wanted to know how to determine when I have a duplicate so I can set the bool. When the bool is set I know how to get all the objects from the list that have it set. – Jeroen-bart Engelen Dec 16 '09 at 11:02
0

Hat tip to https://stackoverflow.com/a/807816/492

var duplicates = duplicationList
                .GroupBy(l => l)
                .Where(g => g.Count() > 1)
                .Select(g => {foreach (var x in g)
                                 {x.HasDuplicate = true;}
                             return g;
                });

duplicates is a throwaway but it gets you there in less enumerations.

Community
  • 1
  • 1
CAD bloke
  • 8,578
  • 7
  • 65
  • 114
-2
var dups = duplicationList.GroupBy(x => x).Where(y => y.Count() > 1).Select(y => y.Key);

foreach (var d in dups)
    Console.WriteLine(d);
mjsabby
  • 1,139
  • 7
  • 14
  • I've tested you code in LINQPad using the following program: void Main() { var duplicationList = new List { new TestObject("1", "2"), new TestObject("3", "4"), new TestObject("1", "6") }; var dups = duplicationList.GroupBy(x => x).Where(y => y.Count() > 1).Select(y => y.Key); dups.Dump("Duplicate dump: " + dups.Count()); } public class TestObject { public TestObject(string s1, string s2) { String1 = s1; String2 = s2; IsDuplicate = false; } public string String1; public string String2; public bool IsDuplicate; } It doesn't work. dups contains 0 values. – Jeroen-bart Engelen Dec 16 '09 at 10:56