I have a list of 300k persons, with some duplicates. But above all, some approximate duplicates.
Eg. : Id LastName FirstName BirthDate
- 1 KENNEDY John 01/01/2000
- 2 KENNEDY John Fitzgerald 01/01/2000
I would like to find these duplicates and treat them appart. I've found some example with Linq's GroupBy, but I cannot find the solution with these 2 subtleties :
- Match the FirstName with a StartsWith
- Keep the whole object entirely (not only the lastname with a Select new)
For the moment, I've got the following. It does the job, but it's very very slow and I'm pretty sure it can be smoother :
var dictionary = new Dictionary<int, List<Person>>();
int key = 1; // the Key could be a string built with LastName, first letters of FirstName... but finally this integer is enough
foreach (var c in ListPersons)
{
List<Person> doubles = ListPersons
.Where(x => x.Id != c.Id
&& x.LastName == c.LastName
&& (x.FirstName.StartsWith(c.FirstName) || c.FirstName.StartsWith(x.FirstName)) // cause dupe A could be "John" and B "John F". Or... dupe A could be "John F" and B "John"
&& x.BirthDate == c.BirthDate
).ToList();
if (doubles.Any())
{
doubles.Add(c); // add the current guy
dictionary.Add(key++, doubles);
}
// Ugly hack to remove the doubles already found
ListPersons = ListPersons.Except(doubles).ToList();
}
// Later I will read my dictionary and treat Value by Value, Person by Person (duplicate by duplicate)
Finally :
With the kind help below and the IEqualityComparer :
// Speedo x1000 !
var listDuplicates = ListPersons
.GroupBy(x => x, new PersonComparer())
.Where(g => g.Count() > 1) // I want to keep the duplicates
.ToList();
// Then, I treat the duplicates in my own way using all properties of the Person I need
foreach (var listC in listDuplicates)
{
foreach (Person c in listC)
{
// Some treatment
}
}