0

I have a large 10 million row file (currently CSV). I need to read through the file, and remove duplicate items based on multiple columns.

Example line of data would look something like:

ComputerName, IPAddress, MacAddress, CurrentDate, FirstSeenDate

I would want to check MacAddress and ComputerName for duplicates and if a duplicate is discovered keep the unique entry with the oldest FirstSeenDate.

I have read a CSV into a variable using import-csv and then processed the variable using sort-object...etc but it's horribly slow.

$data | Group-Object -Property ComputerName,MaAddress | ForEach-Object{$_.Group | Sort-Object -Property FirstSeenDate | Select-Object -First 1}

I am thinking I could use stream.reader and read the CSV line by line building a unique array based on array contains logic.

Thoughts?

BDubs
  • 73
  • 1
  • 14

2 Answers2

0

I would probably use Python if performance were a major concern. Or LogParser.

However, if I had to use PowerShell, I would probably try something like this:

$CultureInfo = [CultureInfo]::InvariantCulture
$DateFormat = 'M/d/yyyy' # Use whatever date format is appropriate

# We need to convert the strings that represent dates. You can skip the ParseExact() calls if the dates are already in a string sortable format (e.g., yyyy-MM-dd).
$Data = Import-Csv $InputFile | Select-Object -Property ComputerName, IPAddress, MacAddress, @{n = 'CurrentDate'; e = {[DateTime]::ParseExact($_.CurrentDate, $DateFormat, $CultureInfo)}}, @{n = 'FirstSeenDate'; e = {[DateTime]::ParseExact($_.FirstSeenDate, $DateFormat, $CultureInfo)}}

$Results = @{}
foreach ($Record in $Data) {
    $Key = $Record.ComputerName + ';' + $Record.MacAddress
    if (!$Results.ContainsKey($Key)) {
        $Results[$Key] = $Record
    }
    elseif ($Record.FirstSeenDate -lt $Results[$Key].FirstSeenDate) {
        $Results[$Key] = $Record
    }
}

$Results.Values | Sort-Object -Property ComputerName, MacAddress | Export-Csv $OutputFile -NoTypeInformation

That may very well be faster because Group-Object is often a bottleneck even though it is quite powerful.

If you really want to try using a stream reader, try using the Microsoft.VisualBasic.FileIO.TextFieldParser class, which is a part of the .Net framework in spite of it's slightly misleading name. You can access it by running Add-Type -AssemblyName Microsoft.VisualBasic.

Bacon Bits
  • 30,782
  • 5
  • 59
  • 66
-1

You could do an import in a database (i.e. SQLite example ) and then query:

SELECT 
  MIN(FirstSeenDate) AS FirstSeenDate, 
  ComputerName, 
  IPAddress, 
  MacAddress
FROM importedData
GROUP BY ComputerName, IPAddress, MacAddress
Luuk
  • 12,245
  • 5
  • 22
  • 33
  • This won't be what OP is asking for if the computer's IP address has changed. You'll have to use a window function like ROW_NUMBER() or RANK(), a correlated subquery, or a self-join to get what he's looking for. – Bacon Bits Jun 02 '19 at 20:46
  • Not only the IP address can change, ComputerName and [Macaddress](https://stackoverflow.com/questions/22310464/how-to-spoof-mac-address-via-code) can also change – Luuk Jun 04 '19 at 06:05
  • They're not concerned about that. That's why they said "I would want to check MacAddress and ComputerName for duplicates". They want to know the first instance that a computer name is seen with a given MAC address, so the key fields are the host name and the MAC. The IP address might change, but it's not what they're looking for. – Bacon Bits Jun 04 '19 at 18:30