C# How to group and find duplicated from a CSV

Question

I have a simple Powershell script I would like to move to C# to see if I can get any better performance. The script takes a large CSV file (200,000+ lines) and looks for duplicates based on properties of FirstName,LastName,DOB. It then exports these findings out to another CSV file for review.

This is the code in PowerShell.

$CSV=Import-CSV $SourceFile 
$CSV | Group-Object -Property FirstName,LastName,DOB | Where { $_.count -ge 2} | ForEach-Object { $_.group } | Export-Csv $DestinationFile -NoTypeInformation

This works perfectly. However, when running against a large file, it is taking 2+ hours on my machine and consuming a CPU core. I'm self-taught in C#, but I'm not sure where to start with this one. From my research, I think LINQ would be the answer here, but I am not familiar with that at all.

I appreciate any help on this in advance to point me in the right direction. Thanks!

EDIT:

I think I'm on the right path, but I can't figure this part out. I've got the CSV importing into a List with a custom type with these fields:

    string Acct;
    string FirstName;
    string LastName;
    string DOB;
    string ID;
    string OpenDate;

So the following code finds the duplicates but it only returns the properties I'm grouping by, and it only returns the duplicate items but not the original first item.

var duplicates = CSV
            .GroupBy(i => new { i.FirstName, i.LastName, i.DOB })
            .Where(g => g.Count() > 1)
            .Select(g => g.Key);

Welcome to Stack Overflow. Please read [ask]. As currently stated, this question reads as _"Please show me the code to do X"_, which is too broad and it's unclear with what part exactly you need help. If you want to learn to import CSV in C#, read [Parsing CSV files in C#, with header](http://stackoverflow.com/questions/2081418/parsing-csv-files-in-c-with-header). If you then want to group the resulting records, see [MSDN: How to: Group Query Results (C# Programming Guide)](https://msdn.microsoft.com/en-us/library/bb545971.aspx). — CodeCaster, Oct 19 '16 at 12:54
If you're just looking to do it in C#, and compare performance, I would start with a C# console application which reads the whole file into your data structures (Person with FirstName, LastName, DOB fields). After all the data is read in, you can definitely use LINQ to check for duplicates etc. and do whatever you want with the results. — GEEF, Oct 19 '16 at 12:54
Just use `SelectMany(g => g)` and you'll be done technically. Whether you've got the *best* solution is debatable though. You could possibly gain performance by parallelization, using dictionaries, or better file readers, but that's beyond the present scope of your question. — Gert Arnold, Oct 19 '16 at 19:34
Thank you Gert! That worked. That knocked out a 2 hour powershell process down to less than 10 seconds. I think that's pretty good. I'm sure there's a ton of ways to do this very thing. Thanks again! — AtomicFireball, Oct 19 '16 at 19:51
You can probably speed it up 10 times further by [reading the CSV file efficiently](https://www.dotnetperls.com/textfieldparser) and processing it while importing to reduce the number of iterations. — wOxxOm, Oct 19 '16 at 20:33

C# How to group and find duplicated from a CSV

0 Answers0