2

I have a powershell script which uses Compare-Object to diff/compare a list of MD5 checksum's against each-other ... how can I speed this up? its been running for hours!

$diffmd5 = 
(Compare-Object -ReferenceObject $localmd5 -DifferenceObject $remotefilehash |
 Where-Object { ($_.SideIndicator -eq '=>') } | 
Select-Object -ExpandProperty InputObject)
js2010
  • 23,033
  • 6
  • 64
  • 66
Tony
  • 8,681
  • 7
  • 36
  • 55

3 Answers3

5

Compare-Object is convenient, but indeed slow; also avoiding the pipeline altogether is important for maximizing performance.

I suggest using a [System.Collections.Generic.HashSet[T] instance, which supports high-performance lookups in a set of unordered[1] values:[2]

# Two sample arrays
$localmd5 = 'Foo1', 'Bar1', 'Baz1'
$remotefilehash = 'foo1', 'bar1', 'bar2', 'baz1', 'more'

# Create a hash set from the local hashes.
# Make lookups case-*insensitive*.
# Note: Strongly typing the input array ([string[]]) is a must.
$localHashSet = [System.Collections.Generic.HashSet[string]]::new(
  [string[]] $localmd5,
  [System.StringComparer]::OrdinalIgnoreCase
)

# Loop over all remote hashes to find those not among the local hashes.
$remotefilehash.Where({ -not $localHashSet.Contains($_) })

The above yields collection 'bar2', 'more'.

Note that if case-sensitive lookups are sufficient, which is the default (for string elements), a simple cast is sufficient to construct the hash set:

$localHashSet = [System.Collections.Generic.HashSet[string]] $localmd5

Note: Your later feedback states that $remotefilehash is a hashtable(-like) collection of key-value pairs rather than a collection of mere file-hash strings, in which the keys store the hash strings. In that case:

  • To find just the differing hash strings (note the .Keys property access to get the array of key values):

    $remotefilehash.Keys.Where({ -not $localHashSet.Contains($_) })
    
  • To find those key-value pairs whose keys are not in the hash set (note the .GetEnumerator() call to enumerate all entries (key-value pairs)):

    $remotefilehash.GetEnumerator().Where({ -not $localHashSet.Contains($_.Key) })
    

Alternatively, if the input collections are (a) of the same size and (b) have corresponding elements (that is, element 1 from one collection should be compared to element 1 from the other, and so on), using Compare-Object with -SyncWindow 0, as shown in js2010's helpful answer, with subsequent .SideIndicator filtering may be an option; to speed up the operation, the -PassThru switch should be used, which forgoes wrapping the differing objects in [pscustomobject] instances (the .SideIndicator property is then added as a NoteProperty member directly to the differing objects).


[1] There is a related type for maintaining sorted values, System.Collections.Generic.SortedSet[T], but - as of .NET 6 - no built-in type for maintaining values in input order, though you can create your own type by deriving from [System.Collections.ObjectModel.KeyedCollection[TKey, TItem]]

[2] Note that a hash set - unlike a hash table - has no values associated with its entries. A hash set is "all keys", if you will - all it supports is testing for the presence of a key == value.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • If I may ask, is this what is called a method override? You're replacing the default `contains` method with a new case insensitive method? Is it possible to do the same with a new `generic.list`? – Santiago Squarzon May 23 '21 at 21:38
  • 1
    @SantiagoSquarzon: _override_ refers to class _declarations_, when a derived class (subclass) redefines a method it would otherwise inherit from its base class. _Overload_ refers to variations of a constructor / method with a given name that differ by their set of parameters. The static `::new()` method that PowerShell exposes on types is simply a way to expose that type's _constructors_, to compensate for the fact that PowerShell has no `new` keyword (though calling constructors with the `New-Object` _cmdlet_ is an alternative). In the case at hand, a _constructor overload_ is being called. – mklement0 May 23 '21 at 22:17
2

By default, compare-object compares every element in the first array with every element in the second array (up to about 2 billion positions), so the order doesn't matter, but large lists would be very slow. -syncwindow 0 would be much faster but would require matches to be in the same exact positions:

Compare-Object $localmd5 $remotefilehash -syncwindow 0

As a simple demo of -syncwindow:

compare-object 1,2,3 1,3,2 -SyncWindow 0 # not equal

InputObject SideIndicator
----------- -------------
          3 =>
          2 <=
          2 =>
          3 <=


compare-object 1,2,3 1,3,2 -SyncWindow 1  # equal

compare-object 1,2,3 1,2,3 -SyncWindow 0  # equal
js2010
  • 23,033
  • 6
  • 64
  • 66
0

I feel this should be faster than Compare-Object

$result = [system.collections.generic.list[string]]::new()

foreach($hash in $remotefilehash)
{
    if(-not($localmd5.contains($hash)))
    {
        $result.Add($hash)
    }
}

The problem here is that .contains method is case sensitive, I believe all MD5 hashes have uppercase letters but if this was not the case you would need to call the .toupper() or .tolower() methods to normalize the arrays.

Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
  • If case-insensitivity is important, you can use the `-contains` _operator_. – mklement0 May 23 '21 at 18:43
  • @mklement0 yes, I was thinking about that, but do you think the operator will be faster than using the method and normalizing? if the operator is more efficient then `-notin` is the best one for this I think but in my experience it is very slow compared with `.contains` – Santiago Squarzon May 23 '21 at 18:46
  • 1
    Yes, `-contains` is slower than `.Contains()`, but that is partly unavoidable, given the extra effort needed for case-_insensitive_ comparison. A direct comparison would be to `-ccontains`, but it too is slower than `.Contains()`, so for case-_sensitive_ comparison you're right: `.Contains()` is better. However, since you can't pass a case-insensitive _comparer_ to `.Contains()`, you'd have to case-normalize _both_ the collection and the string to look for, and that will typically be slower than `-contains`. Note that there's also `-notcontains`, just like there's `-notin`. – mklement0 May 23 '21 at 19:04
  • @mklement0 clearly a hashset with `[System.StringComparer]::OrdinalIgnoreCase` as in your answer is the most efficient and best option. – Santiago Squarzon May 23 '21 at 19:13
  • 1
    I think so, yes, but still good to know that `.Contains()` beats `-ccontains` (though in many cases that won't matter). – mklement0 May 23 '21 at 19:18