PowerShell: compare 2 large CSV files to find users that don't exist in one of them

Question

I have 2 csv files with ~10,000 users each. I need to count how many users appear in csv1 and not in csv2. At the moment I have the code below. However I'm aware that this is probably extremely inefficient as it is potentially looping through up to 10,000 users 10,000 times. The code takes forever to run and I'm sure there must be a more efficient way. Any help or suggestions are appreciated I am fairly new to Powershell

foreach ($csv1User in $csv1) {
        $found = $false
        foreach ($csv2User in $csv2) {
            if ($csv1User.identifier -eq $csv2User.identifier)
            {
                $found = $true
                break
            }
        }
        if ($found -ne $true){
            $count++
        }
    }

Your question isn't clear in what your input looks like and what you actually expect for output. Anyways, you might want to stream one csv file (to save memory) and index the other (using a hash table). In case you do not want to reinvent the wheel, you might try this [`Join-Object`](https://www.powershellgallery.com/packages/Join) cmdlet (see also: [what's the best way to join two tables into one?](https://stackoverflow.com/a/45483110/1701026)): `Import-Csv .\csv1.csv | FullJoin (Import-Csv .\csv2.csv) -On identifier -Discern csv1,csv2 | Export-Csv .\Overview.csv` — iRon, Nov 12 '20 at 07:37

Mathias R. Jessen · Accepted Answer · 2020-11-10T14:25:12.030

If you replace your nested loops with 2 HashSet's, you'll have two ways of calculating the exception between the two:

Using `SymmetricExceptWith()`

The HashSet<T>.SymmetricExceptWith() function allows us to calculate the subset of terms that exist in either collection but not in both:

# Create hashset from one list
$userIDs = [System.Collections.Generic.HashSet[string]]::new([string[]]$csv1.identifier)

# Pass the other list to `SymmetricExceptWith`
$userIDs.SymmetricExceptWith([string[]]$csv2.identifier)

# Now we have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $userIDs.Contains($_.identifier) } |Sort-Object -Unique identifier

Using a sets to track duplicates

Similarly we can use hash sets to keep track of which terms that have been observed at least once, and which ones has been seen more than once:

# Create sets for tracking
$seenOnce = [System.Collections.Generic.HashSet[string]]::new()
$seenTwice = [System.Collections.Generic.HashSet[string]]::new()

# Loop through whole superset of records
foreach($record in @($csv1;$csv2)){
  # Always attempt to add to the $seenOnce set
  if(!$seenOnce.Add($record.identifier)){
    # We've already seen this identifier once, add it to $seenTwice
    [void]$seenTwice.Add($record.identifier)
  }
}

# Just like the previous example, we now have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $seenOnce.Contains($_.identifier) -and -not $seenTwice.Contains($_.identifier) } |Sort-Object -Unique identifier

Using a hash table as a grouping construct

You could also use a dictionary type (like a [hashtable] for example) to group records from both csv files based on their identifier, and then filter on number of record values in each dictionary entry:

# Groups records on their identifier value
$groupsById = @{}
foreach($record in @($csv1;$csv2)){
  if(-not $groupsById.ContainsKey($record.identifier)){
    $groupsById[$record.identifier] = @()
  }
  $groupsById[$record.identifier] += $record
}

# Filter based on number of records with a distinct identifier
$relevantRecords = $groupsById.GetEnumerator() |Where-Object { $_.Value.Count -eq 1 } |Select-Object -Expand Value

score 0 · Answer 2 · answered Nov 10 '20 at 14:14

If you're just looking for the count then this should be much faster.

$csv2 = Import-Csv $csvfile2

Import-Csv $csvfile1 |
    Where-Object identifier -in $csv2.identifier |
        Measure-Object | Select-Object -ExpandProperty Count

Here's a small example

$csvfile1 = New-TemporaryFile
$csvfile2 = New-TemporaryFile

@'
identifier
bob
sally
john
sue
'@ | Set-Content $csvfile1 -Encoding UTF8

@'
identifier
bill
sally
john
stan
'@ | Set-Content $csvfile2 -Encoding UTF8

$csv2 = Import-Csv $csvfile2

Import-Csv $csvfile1 |
    Where-Object identifier -in $csv2.identifier |
        Measure-Object | Select-Object -ExpandProperty Count

Output is simply

PowerShell: compare 2 large CSV files to find users that don't exist in one of them

2 Answers2

Using SymmetricExceptWith()

Using a sets to track duplicates

Using a hash table as a grouping construct

Using `SymmetricExceptWith()`