Powershell: How to merge two arrays based on property

Question

My first array is list of pet owners and their mobile numbers etc. Second array is pets with the information of owner name. Pet list is roughly 1k and owner list is 3k.

What is the fastest way to add the owner information into the pets information? Currently my script takes almost a minute to run which seems a bit too much.

foreach ($pet in $pets) {
    $owner = $owners | Where-Object { $_.name -eq $pet.owner }
    if ($owner) {
        $pet | Add-Member -MemberType NoteProperty -Name "Owner" -Value $owner.name    }
}

So, your code works, but you want to know of a "*faster*" way to do this? — Abraham Zinala, Apr 25 '22 at 12:22
Based on your code, I assume $pets and $owners are arrays of either PSObject or PSCustomObject. I'm also guessing they are each loaded from CSV files. Based on past experience, I'm guessing performance would be greatly increased if you loaded each from CSV into hash tables. As you can see, I'm having to make assumptions about your data in order to guess how to improve performance. To correctly answer your question, we need to replace assumptions with facts. And to prevent problems, we probably need details on how data is searched elsewhere in code. — Darin, Apr 25 '22 at 12:49
To expand upon previous comment, you can create a hash table of owner keys with each value being another hash table or PSObject or PSCustomObject. So, maybe instead of creating an array called $owners, you create a hash table called $owners where each object is stored with a key being the owner's name. Getting the owner object would be something like `$owner=$owners[$pet.owner]`. — Darin, Apr 25 '22 at 14:30

mklement0 · Accepted Answer · 2022-04-25T20:51:06.233

Doing a linear lookup in the owners array for every pet is what makes your approach fundamentally slow: for every pet, 3,000 owner objects must be searched through, resulting in 1,000 x 3,000 = 3 million lookups.

Performance additionally suffers from incidental implementation choices you made:

As powerful and elegant as the PowerShell pipeline is, its one by one streaming is usually noticeably slower than iterating over an array in an expression / language statement, such as a foreach loop.
Additionally, as of PowerShell 7.2.2, the Where-Object and ForEach-Object cmdlets are inefficiently implemented, which adds additional overhead - see GitHub issue #10982 and this answer.
- A functional limitation of Where-Object that additionally worsens performance is that there is no way to stop enumerating once a match has been found; that is, the input is invariably processed in full, and all matches are output.
- By contrast, the analogous .Where() array method does offer a way to stop processing once the first match is found (e.g.,
  (1..10).Where({ $_ -ge 5 }, 'First'). Potentially bringing the same functionality to the Where-Object cmdlet is the subject of GitHub issue #13834.

Therefore, you have two choices:

(A) Pragmatic solution: stick with the fundamentally inefficient approach, but improve implementation efficiency so that the resulting performance may be good enough:
- The (A) solution below is substantially faster than your original, Where-Object-based approach, about 33 to 38 times, depending on the PowerShell edition used; see the next section for benchmarks.
(B) Proper, scalable, much better-performing, but more complex solution: Use an aux. data structure that enables efficient lookup of owner objects by name, such as a hashtable, as suggested by Darin.
- The (B) solution below is about 7 to 13 times faster than the (A) solution, depending on the PowerShell edition used, and therefore ca. 260 to 420 (!) times faster than the Where-Object solution; see the next section for benchmarks.

Note:

In the code below, I've modified your example so that a property other than the owner name is added as a new property to each pet object (.Address, as .OwnerAddress), given that the owner name is present to begin with.
Also, the -MemberType NoteProperty, -Name and -Value parameter names in the Add-Member call are omitted for brevity (they are implied).

Solution (A): Replace the Where-Object pipeline with an (inner) foreach statement:

# Create 1000 sample pets and 3000 sample owners.
$pets = foreach ($i in 1..1000) { [pscustomobject] @{ Name = "Pet $i"; Owner = 'Owner {0}' -f (6 * $i) } }
$owners = foreach ($i in 1..3000) { [pscustomobject] @{ Name = "Owner $i"; Address = "Address $i" } }

foreach ($pet in $pets) { 
  # Perform the lookup more efficiently via an inner `foreach` loop.
  $owner = foreach ($o in $owners) { if ($o.Name -eq $pet.Owner) { $o; break } }
  if ($owner) {
    Add-Member -InputObject $pet OwnerAddress $owner.Address
  }
}

Solution (B): Create a hashtable that maps owner names to owner objects, for efficient lookup:

# Create 1000 sample pets and 3000 sample owners.
$pets = foreach ($i in 1..1000) { [pscustomobject] @{ Name = "Pet $i"; Owner = 'Owner {0}' -f (6 * $i) } }
$owners = foreach ($i in 1..3000) { [pscustomobject] @{ Name = "Owner $i"; Address = "Address $i" } }

# Create a hashtable that maps owner names to owner objects,
# for efficient lookup by name.
$ownerMap = @{}; foreach ($owner in $owners) { $ownerMap[$owner.Name] = $owner }

foreach ($pet in $pets) { 
  # Look up the pet's owner in the owner map (hashtable); returns $null if not found.
  $owner = $ownerMap[$pet.Owner]
  if ($owner) {
    Add-Member -InputObject $pet OwnerAddress $owner.Address
  }
}

Benchmarks

Below are sample timings comparing the three approaches, averaged over 10 runs.
Timing commands is never an exact science in PowerShell, and performance varies based on many factors, not least the hardware with respect to absolute times, but the results below provide a sense of relative performance, as reflected in the Factor output column: 1.00 denotes the fastest command, listed first, with the slower ones expressed as multiples of it, in descending order of speed.
The bottom sections contains the source code, allowing you to run these benchmarks yourself
- Caveat: With the given collection sizes, these benchmarks run for quite a while (up to 10 minutes and more), mostly due to how slow the Where-Object solution is.
- For best results, run the benchmarks while your machine isn't (too) busy doing other things.
Note how performance seems to have improved substantially overall in the cross-platform PowerShell (Core) edition, relative to Windows PowerShell.

Windows PowerShell 5.1 on Windows 10:

Factor Secs (10-run avg.) Command
------ ------------------ -------
1.00   0.234              # Hashtable-assisted lookups....
6.85   1.605              # Nested foreach statements...
261.95 61.353             # Pipeline with Where-Object...

PowerShell (Core) 7.2.2 on Windows 10:

Factor Secs (10-run avg.) Command
------ ------------------ -------
1.00   0.096              # Hashtable-assisted lookups.…
12.70  1.216              # Nested foreach statements…
424.40 40.624             # Pipeline with Where-Object…

Benchmark source code:

The following benchmark code uses function Time-Command from this Gist.
Unless already present, you are prompted to automatically download and define this function in your session. (I can personally assure you that doing that is safe, but you should always check the source code yourself.)

# Download and define function `Time-Command` on demand (will prompt).
# To be safe, inspect the source code at the specified URL first.
if (-not (Get-Command -ErrorAction Ignore Time-Command)) {
  $gistUrl = 'https://gist.github.com/mklement0/9e1f13978620b09ab2d15da5535d1b27/raw/Time-Command.ps1'
  if ((Read-Host "`n====`n  OK to download and define benchmark function ``Time-Command```n  from Gist ${gistUrl}?`n=====`n(y/n)?").Trim() -notin 'y', 'yes') { Write-Warning 'Aborted.'; exit 2 }
  Invoke-RestMethod $gistUrl | Invoke-Expression
  if (-not ${function:Time-Command}) { exit 2 }
}

# Define the collection sizes
$petCount = 1000
$ownerCount = 3000

# Define a sample owners array.
$owners = foreach ($i in 1..$ownerCount) { [pscustomobject] @{ Name = "Owner $i"; Address = "Address $i" } }

# Define a script block that creates a sample pets array.
# Note: We use a script block, because the array must be re-created
#       for each run, since the pet objects get modified.
$petGenerator = {
  foreach ($i in 1..$petCount) { [pscustomobject] @{ Name = "Pet $i"; Owner = 'Owner {0}' -f (6 * $i) } }
}

# Define script blocks with the commands to time.
$commands = @(
  { # Nested foreach statements
    $pets = & $petGenerator
    foreach ($pet in $pets) { 
      $owner = foreach ($o in $owners) { if ($o.Name -eq $pet.Owner) { $o; break } }
      if ($owner) {
        Add-Member -ea stop -InputObject $pet OwnerAddress $owner.Address
      }
    }
  },
  { # Pipeline with Where-Object
    $pets = & $petGenerator
    foreach ($pet in $pets) { 
      $found = $owners | Where-Object { $_.name -eq $pet.Owner }
      if ($found) {
        Add-Member -InputObject $pet OwnerAddress $owner.Address
      }
    }
  },
  { # Hashtable-assisted lookups.
    $pets = & $petGenerator
    $ownerMap = @{}; foreach ($owner in $owners) { $ownerMap[$owner.Name] = $owner }
    foreach ($pet in $pets) { 
      $owner = $ownerMap[$pet.Owner]
      if ($owner) {
        Add-Member -InputObject $pet OwnerAddress $owner.Address
      }
    }
  }
)

Write-Verbose -Verbose 'Running benchmarks...'

# Average 10 runs.
# Add -OutputToHost to print script-block output, if desired.
Time-Command -Count 10 $commands

Thanks for the detailed explanation! And thanks for pointing out the crazy performance difference of (B) solution! Related to pipeline you mention "its one by one streaming is usually noticeably slower than iterating over an array in an expression / language statement", taking note of the "usually", there are 2 exceptions that I know of. A pipeline with multiple pipe segments, each having properly designed functions, will run each section in parallel. The other is the Split-Pipeline module, which is great when each item requires lengthy processing. — Darin, Apr 26 '22 at 02:01
I'm glad to hear it was helpful, @Darin. Note that commands in a multi-segment pipeline don't run in parallel: they are started in parallel up front, but then objects being streamed are handed off one at a time from one command to the next. This is generally slower than iterating via a language statement or via an operator. I'm not familiar with `Split-Pipeline`, but in case it refers to true parallel processing a la `ForEach-Object -Parallel` in v7+: yes, parallel processing can (but doesn't always) speed up things, but that's a separate aspect. — mklement0, Apr 26 '22 at 05:15
Split-Pipeline works in PS 5.1, appears to work in v7+. It is a C# project (https://github.com/nightroman/SplitPipeline). In a test, I piped I think it was 35 machine names into Split-Pipeline that called a script block, limited to 10 parallel. The code was taking an average of around 3 minutes per machine to remote query hardware and return a PSCustomObject with the info. The test was under 14 minutes, but I didn't actually notice it when finished, so not sure the exact time. I was just excited at how it actually was returning results on the pipeline, and so quickly. — Darin, Apr 26 '22 at 11:25
@Darin, looks like a neat project that at its core does what `ForEach-Object -Parallel` does in v7+, but with some extra features, notably the ability to produce output in input order. As an aside: if the inputs are computer names, you can usually pass them directly as an array to the `-ComputerName` parameter of `Get-CimInstance` and `Invoke-Command`. — mklement0, Apr 26 '22 at 14:03
I need to compare 2 Import-CSV objects, which have two keys that might match. I was wondering how to loop through both objects, but you made it clear for me. For each key I will make a a hash for both objects, next the comparison will be easy. — Walter A, Mar 06 '23 at 10:45

Powershell: How to merge two arrays based on property

1 Answers1

Benchmarks