1

I'm trying to develop a LINQ query that will identify objects that have duplicate values. I only need the objects where a string in a multivalued attribute matches a string in the same attribute on another object AND the "name" values don't match.

I am trying to use the following code, but it does not work because it doesn't seem possible to use the "o" variable in a subquery.

myList.Where(o => myList.Any(a => a.name != o.name && a.multival.Any(p => o.multival.Contains(p))))
Trevor
  • 55
  • 3
  • 1
    Please post a more detailed question with code. It's hard to answer without knowing what your object looks like. – Sach Apr 22 '19 at 21:33
  • 2
    Readability is king, @App-Devon is spot on. Don't try to write this in LINQ unless it's just for learning purposes – reggaeguitar Apr 22 '19 at 22:25
  • 1
    Using the o variable in the subquery should work just fine. What is the actual error that you are getting? – Jeffrey L Whitledge Apr 22 '19 at 22:47
  • 1
    Your query seems fine. If I write this class `public class X { public string name; public IEnumerable multival; }` then your code runs just fine. You need to provide a [mcve] that demonstrates the problem. – Enigmativity Apr 23 '19 at 23:54
  • 1
    Have you checked that myList is not null, or not being populated? – App-Devon Apr 24 '19 at 05:07

2 Answers2

2

Why even use linq for this? it will be convoluted and difficult to read. I would solve this problem with a nested for loop:

var listOfDuplicates = new IEnumerable<YourObjectType>();
foreach (var a in myList)
{
     foreach (var b in myListb)
     {
         if (a.multival == b.multival && a.name != b.name)
             listOfDuplicates.Add(a);
     }
}

In response to comments, this is how one would implement a method to exit similar to LINQs FirstOrDefault() and other methods that exit after X amount of matches:

Public IEnumerable<YourObjectType> FindDuplicates(IEnumerable<YourObjectType> myList, int maxDupes)
{
    var listOfDuplicates = new IEnumerable<YourObjectType>();
    foreach (var a in myList)
    {
        foreach (var b in myListb)
        {
            if (a.multival == b.multival && a.name != b.name)
                listOfDuplicates.Add(a);
            if (listOfDuplicates.length == maxDupes)
                return listOfDuplicates;
        }
    }
    return listOfDuplicates;
}
App-Devon
  • 253
  • 2
  • 10
  • 2
    This was my initial approach, but I'm working with a large dataset and it was taking a long time to execute. In my experience a LINQ query is much faster. – Trevor Apr 23 '19 at 00:49
  • 1
    If the dataset is from a database, I would highly recommend doing this on the database end through a view, or stored procedure. If not, then there are more optimal algorithms that you could use to accomplish this (I just went for the fast & dirty implementation). I believe that your assumption about LINQs superior speed may not be correct, though (see this [question](https://stackoverflow.com/questions/3156059/is-a-linq-statement-faster-than-a-foreach-loop)) – App-Devon Apr 23 '19 at 02:14
  • 2
    One of the reasons to use LINQ, is because you don't know what your caller wants to do with your sequence. If he only wants the FirstOrDefault, or only wants to Take(3), if would be a total waste to create the full list. So to make your method re-usable, instead of only usable for this one and only use case, it is better to return an IEnumerable – Harald Coppoolse Apr 23 '19 at 19:46
  • @HaraldCoppoolse I edited my response to create an IEnumerable based on your feedback. of course, if you wanted to stop after X matches this for loop structure could be rewritten to be just as efficient as LINQ (I'll update my answer to illustrate implementation of how in just a second). – App-Devon Apr 23 '19 at 20:48
  • The OP posed this question because their query did not run fine. Whether it runs in your environment or not is irrelevant, whatever they are doing in their's is not working. I still feel this is a better option, and the question is about how to find duplicates in List in C# at its base. @Enigmativity you could yourself post an answer saying "Nothing is wrong with the query. Problem solved." – App-Devon Apr 24 '19 at 04:56
0

Your query should actually "work," but it's not going to be very efficient if your list size is particularly large. If you're having troubles compiling, check to be sure you do not have any typos. If you're having problems at runtime, add some null checks on your variables and properties. The rest of this answer is to guide how you might utilize Linq to make your query better.

Given the query you have attempted to write, I am going to infer the following closely approximates the relevant parts of your class structure, though I'm using different name for what you have as "multival."

class Foo 
{
    public string Name { get; set; }
    public string[] Attributes { get; set; }
}

And then given an object list looking roughly like this

var mylist = new List<Foo>
{
    new Foo { Name = "Alpha", Attributes = new[] { "A", "B", "C" } },
    new Foo { Name = "Bravo", Attributes = new[] { "D", "E", "F" } },
    new Foo { Name = "Charlie", Attributes = new[] { "G", "H", "A" } }
};

For finding objects that match any other object based on any match of an attribute, this is how I would approach it using Linq:

var part1 = from item in mylist 
            from value in item.Attributes 
            select new { item, value };

var query = (from pairA in part1
            join pairB in part1 on pairA.value equals pairB.value
            where pairA.item.Name != pairB.item.Name
            select pairA.item)
            .Distinct(); // ToList() to materialize, as necessary

If you were to run that through your editor of choice and explore the contents of part2, you would expect to see objects "Alpha" and "Charlie" based on the shared attribute of "A".

This approach should scale much better than a nested foreach should the size of your initial list be significant (for example, your list containing 10,000 elements instead of 3), which is precisely what your initial approach is.

Anthony Pegram
  • 123,721
  • 27
  • 225
  • 246
  • 1
    The OP's query runs fine. How does this answer help? – Enigmativity Apr 23 '19 at 23:55
  • 1
    Again, put a large number of items in the list and reevaluate. It's going to be *bad bad bad*. – Anthony Pegram Apr 23 '19 at 23:59
  • @Enigmativity, say you loaded the list with `for (int i = 0; i < 20000; i++) { var name = Guid.NewGuid().ToString(); mylist.Add(new Foo { Name = name, Attributes = new[] { name.Substring(0,4) } }); }` Run that using the original code versus using the Linq Select Many into a Join approach. It becomes obvious which approach you would want if the list was large. – Anthony Pegram Apr 24 '19 at 00:05