1

I have two arrays that I would like to take the difference between. I had some success with COMPARE-OBJECT, but is too slow for larger arrays. In this example $ALLVALUES and $ODD are my two arrays.

I used to be able to do this efficiently using FINDSTR ex. FINDSTR /V /G:ODD.txt ALLVALUES.txt > EVEN.txt FINDSTR finished this in under 2 seconds for 110,000 elements. (even had to read and write from the disk)

I'm trying to get back to the FINDSTR performance where it would give me everything in ALLVALUES.txt that did NOT match ODD.txt (giving me the EVEN values in this case)

NOTE: This question is not about ODD or EVEN, only a practical example that can be quickly and visually verified that it is working as desired.

Here is the code that I have been playing with. Using COMPARE-OBJECT, 100,000 took like 200 seconds vs 2 seconds for FINDSTR on my computer. I'm thinking there is a much more elegant way to do this in PowerShell. Thanks for your help.

# -------  Build the MAIN array
$MIN = 1
$MAX = 100000
$PREFIX = "AA"

$ALLVALUES = while ($MIN -le $MAX) 
{
   "$PREFIX{0:D6}" -f $MIN++
}


# -------  Build the ODD values from the MAIN array
$MIN = 1
$MAX = 100000
$PREFIX = "AA"

$ODD = while ($MIN -le $MAX) 
{
   If ($MIN%2) {
      "$PREFIX{0:D6}" -f $MIN++
   }
  ELSE {
    $MIN++
   }
}

Measure-Command{$EVEN = Compare-Object -DifferenceObject $ODD -ReferenceObject $ALLVALUES -PassThru}
BEEBUG
  • 35
  • 1
  • 4

1 Answers1

6

The arrays are objects, not just simple blobs of text that findstr processes.
The fastest diff of string arrays is .NET3.5+ HashSet.SymmetricExceptWith.

$diff = [Collections.Generic.HashSet[string]]$a
$diff.SymmetricExceptWith([Collections.Generic.HashSet[string]]$b)
$diffArray = [string[]]$diff

46 ms for 100k elements on i7 CPU using your data.

The above code omits duplicate values so if those are needed in the output, I think we'll have to use a much much slower manual enumeration.

function Diff-Array($a, $b, [switch]$unique) {
    if ($unique.IsPresent) {
        $diff = [Collections.Generic.HashSet[string]]$a
        $diff.SymmetricExceptWith([Collections.Generic.HashSet[string]]$b)
        return [string[]]$diff
    }
    $occurrences = @{}
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = [Math]::Abs($_.value)
        while ($cnt--) { $_.key }
    }
}

Usage:

$diffArray = Diff-Array $ALLVALUES $ODD

340 ms, 8x slower than hashset but 110x faster than Compare-Object!

And lastly, we can make a faster Compare-Object for arrays of strings/numbers:

function Compare-StringArray($a, $b, [switch]$unsorted) {
    $occurrences = if ($unsorted.IsPresent) { @{} }
                   else { [Collections.Generic.SortedDictionary[string,int]]::new() }
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = $_.value
        if ($cnt) {
            $diff = [PSCustomObject]@{
                InputObject = $_.key
                SideIndicator = if ($cnt -lt 0) { '=>' } else { '<=' }
            }
            $cnt = [Math]::Abs($cnt)
            while ($cnt--) {
                $diff
            }
        }
    }
}

100k elements: 20-28x faster than Compare-Object, completes in 2100ms / 1460ms (unsorted)
10k elements: 2-3x faster, completes in 210ms / 162ms (unsorted)

wOxxOm
  • 65,848
  • 11
  • 132
  • 136
  • awesome solution, got processed 1.4M vs 600k strings in 30 seconds, in my case I used IntersectWith because I wanted the intersection – Kat Lim Ruiz May 31 '19 at 14:45