I have a large 10 million row file (currently CSV). I need to read through the file, and remove duplicate items based on multiple columns.
Example line of data would look something like:
ComputerName, IPAddress, MacAddress, CurrentDate, FirstSeenDate
I would want to check MacAddress and ComputerName for duplicates and if a duplicate is discovered keep the unique entry with the oldest FirstSeenDate.
I have read a CSV into a variable using import-csv and then processed the variable using sort-object...etc but it's horribly slow.
$data | Group-Object -Property ComputerName,MaAddress | ForEach-Object{$_.Group | Sort-Object -Property FirstSeenDate | Select-Object -First 1}
I am thinking I could use stream.reader and read the CSV line by line building a unique array based on array contains logic.
Thoughts?