1

I am new to Linq and still trying to get familiar with it. I have below linq query to find the duplicates and it works perfectly like below:

//"MergedName" is the Data column I am querying and finding the duplicate Names.

 var duplicates = result.AsEnumerable()
            `.Select(dr => dr.Field<string("MergedName").Replace("'", "''"))
            .GroupBy(x => x)
            .Where(g => g.Count() > 1)
            .Select(g => g.Key)
            .ToList();

  foreach (string duplicate in duplicates.ToArray())
            {
                // Logic to keep one and delete another duplicate.
            }

Now, I want to find the Similar Names on the same column "MergedName". For e.g : John Smith and John Smith Jr. I wrote something with .Where clause but something is wrong with my syntax

 var duplicates = result.AsEnumerable()
                .Select(dr => dr.Field<string>("MergedName").Replace("'", "''"))
                .Where(C => C.Field<string>("MergedName").ToLower().IndexOf(C.Field<string>("MergedName").ToLower().Trim()) != 1)
                .Select(g => g.Key)
                .ToList();

foreach (string duplicate in duplicates.ToArray())
            {
               // Logic to keep one and delete another duplicate.
            }

Error : On Where Statement - "String does not contain a definition for 'Field' and the best extension method overload 'System.Data.DatarowExtensions.Field has some invalid arguments'.

Can you please help me with this code ? or any other way where I can find the similar names using contains.

2 Answers2

0

Its hard to say without you posting the type of original collection, but the problem seems to be exactly what is stated in your error message.

LINQ works in iterative steps and after call to Select(dr => dr.Field<string>("MergedName").Replace("'", "''")) you next expresion works with collection of strings. And there in no method .Field on String type

I guess you could try simplifying C.Field<string>("MergedName").ToLower()in your where clause to C.ToLower()

You are also not doing any grouping in your second statement so Select(g => g.Key) wont work, because there is no Key property on String type either.

This addresses only syntax, your Where clause still looks pretty weird. You are comparing each string to itself.

You could try something like

var names = result.AsEnumerable()
            .Select(dr => dr.Field<string("MergedName").Replace("'", "''").ToLower().Trim())
            .ToList(); //ToList not necessary here, but could prevent multiple executions of the expresion


var duplicates = names.Where(n => names.Any(m => n.IndexOf(m) != -1)) //quadratic complexity
            .ToList();

Just put your own condition into Any part of the last statement, you have two strings m and n there and can compare them however you wish.
This is definitely not the best performing solution of your problem, but it uses LINQ as in your questinon and is easy to write and understand.

After clarification:

var enumerableResult = result.AsEnumerable();
var duplicates = enumerableResult.
                .Where(dr => enumerableResult.Any(dr2 => /*your comparison*/)
                .ToList();

comparison could be something like:

dr.Field<string>("MergedName").Replace("'", "''").Trim().ToLower().IndexOf(dr2.Field<string>("MergedName").Replace("'", "''").Trim().ToLower()) != -1

This condition is based on the one in your question, not the one in your comment. But you dont need to use this inline sytax and can call some custom method, so it would look like .Any(dr2 => AreSamePerson(dr, dr2))

This has again quadratic complexity, it is problem only if you have a lot of records to compare.

Now you get collection of your person objects instead of just strings. Keep in mind that you cant just delete members of duplicate collection from original one but some rather complex logic is needed.

So the best solution seems to be:

var duplicates = result.AsEnumerable()
            .GroupBy(x => x, new PersonyComparer() )
            .Where(g => g.Count() > 1)

class PersonyComparer : IEqualityComparer<Person>//person is the type of objects that are in starting collection
    {
        public bool Equals(Person b1, Person b2)
        {
            if (b2 == null && b1 == null)
                return true;
            else if (b1 == null | b2 == null)
                return false;


            if(/*your condition*/)
                return true;
            else
                return false;
        }

        public int GetHashCode(Person bx)
        {
            return 0; //you must make sure that objects that are equal have same hashcode
        }
    }

This can lead to problems, so make sure that your equal function is symetric (if a==b then b == a) and transitive (if a==b and b==c then a==c). Otherwise your groupping could be screwed a bit.

Then you can iterate over the objects of duplicate collection

foreach(var pgroup in duplicates)
{
    foreach(var person in pgroup .Skip(1))
    {
        //remove from original collection
    }
}
Noxor
  • 61
  • 1
  • 7
  • Noxor - Thank you for the above solution, it helps in proper direction. I have two Names which is same person in the collection : "John Mat Smith" and another records is "John Matthew Smith", I know they are same person, so want to delete one of them. that's the reason, I was looking to use Contains instead of Any.Referred to this link : "https://stackoverflow.com/questions/23526773/what-is-the-difference-between-contains-and-any-in-linq". Is there a way I can first query the similar names and put them in the list ? Apologies for not clarifying this earlier. – Mukund Kulkarni Sep 23 '17 at 16:21
  • I guess contains use equal method of the object, so you could override it, so objects with similar names would return true. This however seems like a really bad idea. If you want to delete "duplicate" records from the original collection you shold try to keep collection of duplicate objects and not just of strings. See my edit of the answer. – Noxor Sep 23 '17 at 17:23
  • How do you know they are the same person. If you have some other property, like social security number, it would be better use this for groupping instead of name. – Noxor Sep 23 '17 at 18:25
  • As you mentioned we are comparing the SSNs. However, this code was written by someone else and the way it was written was they were grouping by Exact Names .Group by ( g=> g.count() > 1) to find duplicates, but then I found even with exact names, the SSNs were mismatching by few digits, so business realized this and they kept a threshold of digit mismatching upto 3 digits. Right now, when I get duplicate names, based on those duplicates, I compare each SSN with each other and remove 2 names and keep one something like latest date . – Mukund Kulkarni Sep 23 '17 at 23:54
  • 123456789 - John Smith, 122345667 - John Matt Smith, 123456778- John Matthew Smith, I have these records so there are differences in SSNs as well as Names, but I want to treat them as duplicates and remove 2 and keep one. I was thinking to play with Names first and then compare with SSNs, but I am ok with changing the approach completely,as you suggested by grouping using SSNs. I would prefer using linq based on existing solution, but I am ok to handle this in SQL or C# or any other way. Any ideas are definitely welcome. – Mukund Kulkarni Sep 23 '17 at 23:57
0

Let me show you by example why you shouldn't want this. As Noxor correctly stated, a viable approach would be to use an IEqualityComparer. But now the question is: what is equal? Your "contains equality" introduces a fuzziness you can't solve.

Let me explain this in the most basic way, forgetting case and string replacements. See this little Linqpad program:

void Main()
{
    var dt = new DataTable();
    dt.Columns.Add("MergedName", typeof(string));

    dt.Rows.Add("Abby Kelley Foster");
    dt.Rows.Add("Kelley Foster");
    dt.Rows.Add("Abby Kelley");

    dt.AsEnumerable()
        .Select(r => r.Field<string>("MergedName"))
        .GroupBy(s => s, new SubstringComparer())
        .Select(g => new { g.Key, Count = g.Count() })
        .Dump();

}

public class SubstringComparer : IEqualityComparer<string>
{
    public bool Equals(string left, string right)
    {
        return left.Contains(right) || right.Contains(left);
    }

    public int GetHashCode(string value)
    {
        return 0; // Just return 0; There is no hashing mechanism implemented that gives "Abby Kelley Foster" and "Abby Kelley" the same hashcode.
    }
}

What's the output? Right:

Abby Kelley Foster  3

But let's change the order of the data rows now:

    dt.Rows.Add("Abby Kelley");
    dt.Rows.Add("Kelley Foster");
    dt.Rows.Add("Abby Kelley Foster");

Can you deduct the output? Here it is:

Abby Kelley 1
Kelley Foster   2

What happened to Abby Kelley Foster?

The comparer first encountered the two first unequal rows, counted 1 for Abby Kelley and went on comparing Kelley Foster and Abby Kelley Foster: bingo! "Equal". However, at this point it never returns to the first row to compare it to the third row.

You could try a more sophisticated (but still simple) algorithm that compares all rows, but then you'll get

Abby Kelley Foster  3

Still wrong. Only Abby Kelley and Abby Kelley Foster are the same person. Kelley Foster is someone else entirely. In other words: you can't resolve this by any automated algorithm. Only exact equalities can be determined by a simple algorithm.

To hit this home with a contrived example: suppose one entry is nothing but "Jr.". Now all names with "Jr." will be seen as duplicates!

Gert Arnold
  • 105,341
  • 31
  • 202
  • 291
  • Gert Arnold - Agreed. The example you provided are the ones I have encountered and thought of in my data. – Mukund Kulkarni Sep 23 '17 at 23:50
  • Alright, you may have thought of it, but to me the only conclusion is that you can't go on like this. I didn't yet mention the fact that the first output (3) isn't right logically either, because it contains two names that wouldn't have been grouped when offered without the third. – Gert Arnold Sep 24 '17 at 07:52