Find approximate duplicates in a list

Question

I have a list of 300k persons, with some duplicates. But above all, some approximate duplicates.

Eg. : Id LastName FirstName BirthDate

1 KENNEDY John 01/01/2000
2 KENNEDY John Fitzgerald 01/01/2000

I would like to find these duplicates and treat them appart. I've found some example with Linq's GroupBy, but I cannot find the solution with these 2 subtleties :

Match the FirstName with a StartsWith
Keep the whole object entirely (not only the lastname with a Select new)

For the moment, I've got the following. It does the job, but it's very very slow and I'm pretty sure it can be smoother :

var dictionary = new Dictionary<int, List<Person>>();
int key = 1; // the Key could be a string built with LastName, first letters of FirstName... but finally this integer is enough
foreach (var c in ListPersons)
{
    List<Person> doubles = ListPersons
        .Where(x => x.Id != c.Id
        && x.LastName == c.LastName
        && (x.FirstName.StartsWith(c.FirstName) || c.FirstName.StartsWith(x.FirstName)) // cause dupe A could be "John" and B "John F". Or... dupe A could be "John F" and B "John"
        && x.BirthDate == c.BirthDate 
        ).ToList();

    if (doubles.Any())
    {
       doubles.Add(c); // add the current guy
       dictionary.Add(key++, doubles);
    }

    // Ugly hack to remove the doubles already found
    ListPersons = ListPersons.Except(doubles).ToList();
}

// Later I will read my dictionary and treat Value by Value, Person by Person (duplicate by duplicate)

Finally :

With the kind help below and the IEqualityComparer :

// Speedo x1000 !
var listDuplicates = ListPersons
.GroupBy(x => x, new PersonComparer())
.Where(g => g.Count() > 1) // I want to keep the duplicates
.ToList();

// Then, I treat the duplicates in my own way using all properties of the Person I need
foreach (var listC in listDuplicates)
{
 foreach (Person c in listC)
 {
   // Some treatment
 }
}

See this question for string comparison with a tolerance: [Comparing strings with tolerance](https://stackoverflow.com/q/2344320/880511). You can try and implement this in your solution. — Abbas, Jun 15 '21 at 08:25

Tim Schmelter · Answer 1 · 2021-06-15T09:30:36.837

You can always build your own IEqualityComparer<T>:

public class PersonComparer : IEqualityComparer<Person>
{
    public bool Equals(Person x, Person y)
    {
        return x?.LastName == y?.LastName && x?.BirthDate == y?.BirthDate
            && (x?.FirstName?.StartsWith(y?.FirstName) == true || y?.FirstName?.StartsWith(x?.FirstName) == true) ;
    }

    public int GetHashCode(Person obj)
    {
        unchecked 
        {
            int hash = 17;
            hash = hash * 23 + (obj?.LastName?.GetHashCode() ?? 0);
            hash = hash * 23 + (obj?.BirthDate.GetHashCode() ?? 0);
            return hash;
        }
    }
}

If you just want to keep the first, so remove other duplicates:

ListPersons = ListPersons
    .GroupBy(x => x, new PersonComparer())
    .Select(g => g.First())
    .ToList();

You can use this comparer for many other LINQ methods or even for a dictionary or HashSet<T>. For example, you could also remove duplicates in this way:

HashSet<Person> persons = new HashSet<Person>(ListPersons, new PersonComparer());

another way with pure LINQ:

ListPersons = ListPersons.Distinct(new PersonComparer()).ToList();

Thank you very much ! Indeed it's way more efficient ! :) I'll complete my original post with the final code. — Grimness, Jun 15 '21 at 09:25
@Grimness: I guess that the HashSet approach is most efficient or similar: `ListPersons.Distinct(new PersonComparer())`. Added it to the answer. The `GroupBy` has the advantage that you can add a logic easily what duplicate you want to keep: use `g.OrderBy(logic).First()`. — Tim Schmelter, Jun 15 '21 at 09:29
Maybe you've seen my edit of the original post since : indeed I want too "keep" the duplicates :). I have no doubt Distinct is more efficient but the GroupBy is so much efficient, it is enough for me :). It can help others anyway ! Thanks again. — Grimness, Jun 15 '21 at 09:51

Find approximate duplicates in a list

1 Answers1