I have a simple Powershell script I would like to move to C# to see if I can get any better performance. The script takes a large CSV file (200,000+ lines) and looks for duplicates based on properties of FirstName,LastName,DOB. It then exports these findings out to another CSV file for review.
This is the code in PowerShell.
$CSV=Import-CSV $SourceFile
$CSV | Group-Object -Property FirstName,LastName,DOB | Where { $_.count -ge 2} | ForEach-Object { $_.group } | Export-Csv $DestinationFile -NoTypeInformation
This works perfectly. However, when running against a large file, it is taking 2+ hours on my machine and consuming a CPU core. I'm self-taught in C#, but I'm not sure where to start with this one. From my research, I think LINQ would be the answer here, but I am not familiar with that at all.
I appreciate any help on this in advance to point me in the right direction. Thanks!
EDIT:
I think I'm on the right path, but I can't figure this part out. I've got the CSV importing into a List with a custom type with these fields:
string Acct;
string FirstName;
string LastName;
string DOB;
string ID;
string OpenDate;
So the following code finds the duplicates but it only returns the properties I'm grouping by, and it only returns the duplicate items but not the original first item.
var duplicates = CSV
.GroupBy(i => new { i.FirstName, i.LastName, i.DOB })
.Where(g => g.Count() > 1)
.Select(g => g.Key);