2

I'm processing large amounts of data and after pulling the data and manipulating it, I have the results stored in memory in a variable.

I now need to separate this data into separate variables and this was easily done via piping and using a where-object, but this has slowed down now that I have much more data (1 million plus members). Note: it takes about 5+ minutes.

$DCEntries = $DNSQueries | ? {$_.ClientIP -in $DCs.ipv4address -Or $_.ClientIP -eq '127.0.0.1'}
$NonDCEntries = $DNSQueries | ? {$_.ClientIP -notin $DCs.ipv4address -And $_.ClientIP -ne '127.0.0.1'} 

#Note: 
#$DCs is an array of 60 objects of type Microsoft.ActiveDirectory.Management.ADDomainController, with two properties:  Name, ipv4address
#$DNSQueries is a collection of pscustomobjects that has 6 properties, all strings.

I immediately realize I'm enumerating $DNSQueries (the large object) twice, which is obviously costing me some time. As such I decided to go about this a different way enumerating it once and using a Switch statement, but this seems to have exponentially caused the timing to INCREASE, which is not what I was going for.

$DNSQueries | ForEach-Object {
    Switch ($_) {
        {$_.ClientIP -in $DCs.ipv4address -Or $_.ClientIP -eq '127.0.0.1'} {
            # Query is from a DC
            $DCEntries += $_
        }
        default {
            # Query is not from DC
            $NonDCEntries += $_
        }
    }
}

I'm wondering if someone can explain to me why the second code takes so much more time. Further, perhaps offer a better way to accomplish what I want.

Is the Foreach-Object and/or appending of the sub variables costing that much time?

Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
  • 1
    I presume you have defined `$DCEntries` and `$NonDCEntries` as `@( )` in your second code snippet ? – Santiago Squarzon Nov 15 '22 at 21:11
  • See: [Why should I avoid using the increase assignment operator (+=) to create a collection](https://stackoverflow.com/a/60708579/1701026). – iRon Nov 16 '22 at 08:46
  • You might be able to squeeze some performance from `$_.ClientIP -in $DCs.ipv4address` using a hashset: `$IPs = [System.Collections.Generic.HashSet[String]]$DCs.ipv4address` and in the condition: `$IPs.Contains($_.ClientIP)`. – iRon Nov 16 '22 at 08:46
  • As an aside, if you have a "*1 million plus members*", you might reconsider storing everything in memory, knowing that PowerShell objects are optimized for streaming (and therefore quiet heavy). Besides, your (local) condition could probably be done in the time you import (and export) the objects (even if they come from disk), meaning: **the performance of a complete (PowerShell) solution is supposed to be better than the sum of its parts**, see also: [Fastest Way to get a uniquely index item from the property of an array](https://stackoverflow.com/a/59437162/1701026). – iRon Nov 16 '22 at 08:46
  • @iRon, I'm not sure I understand your statement. From my perspective I am doing everything in memory. I have a variable with 1m+ items. Isn't that 'in memory'? Further I need to take actions on that data based on differing criteria as needed, which I'd normally do with a simple pipe ( using where-object ). But doing that requires iterating through 1m+ items every time my criteria changes, so I'm trying to break the overall dataset down into smaller chunks. I guess I could attempt this during the data import itself, but I'm not sure I want to do that as the chunks could change. – Matthew McDonald Nov 17 '22 at 18:19
  • Sorry, what I meant is that instead of doing everything in memory, you should consider to use the [PowerShell Pipeline](https://learn.microsoft.com/powershell/module/microsoft.powershell.core/about/about_pipelines), something like: ` | ForEach-Object { Switch ... } | ` meaning [**One-at-a-time processing**](https://learn.microsoft.com/powershell/module/microsoft.powershell.core/about/about_pipelines#one-at-a-time-processing) and which saves a lot of memory and might be as fast. See also https://stackoverflow.com/a/58357033/1701026 – iRon Nov 17 '22 at 19:07

1 Answers1

3

ForEach-Object is actually the slowest way to enumerate a collection but also there is a follow-up switch with a script block condition causing even more overhead.

If the collection is already in memory, nothing can beat a foreach loop for linear enumeration.

As for your biggest problem, the use of += to add items to an array and it being a collection of a fixed size. PowerShell has to create a new array and copy all items each time a new item is added, this is very inefficient. See this answer as well as this awesome documention for more details.

In this case you can combine a List<T> with PowerShell's explicit assignment.

$NonDCEntries = [Collections.Generic.List[object]]::new()

$DCEntries = foreach($item in $DNSQueries) {
    if($item.ClientIP -eq '127.0.0.1' -or $item.ClientIP -in $DCs.IPv4Address) {
        $item
        continue
    }

    $NonDCEntries.Add($item)
}

To put into perspective how exponentially bad += to an array is, this a performance test comparing PowerShell explicit assignment from a loop and adding to a List<T> versus adding to an Array.

$tests = @{
    'PowerShell Explicit Assignment' = {
        param($count)

        $result = foreach($i in 1..$count) {
            $i
        }
    }
    '.Add(..) to List<T>' = {
        param($count)

        $result = [Collections.Generic.List[int]]::new()
        foreach($i in 1..$count) {
            $result.Add($i)
        }
    }
    '+= Operator to Array' = {
        param($count)

        $result = @()
        foreach($i in 1..$count) {
            $result += $i
        }
    }
}

5000, 10000, 25000, 50000, 75000, 100000 | ForEach-Object {
    $groupresult = foreach($test in $tests.GetEnumerator()) {
        $totalms = (Measure-Command { & $test.Value -Count $_ }).TotalMilliseconds

        [pscustomobject]@{
            CollectionSize    = $_
            Test              = $test.Key
            TotalMilliseconds = [math]::Round($totalms, 2)
        }

        [GC]::Collect()
        [GC]::WaitForPendingFinalizers()
    }

    $groupresult = $groupresult | Sort-Object TotalMilliseconds
    $groupresult | Select-Object *, @{
        Name       = 'RelativeSpeed'
        Expression = {
            $relativespeed = $_.TotalMilliseconds / $groupresult[0].TotalMilliseconds
            [math]::Round($relativespeed, 2).ToString() + 'x'
        }
    }
}

Below the test results:

CollectionSize Test                           TotalMilliseconds RelativeSpeed
-------------- ----                           ----------------- -------------
          5000 PowerShell Explicit Assignment              0.56 1x
          5000 .Add(..) to List<T>                         7.56 13.5x
          5000 += Operator to Array                     1357.74 2424.54x
         10000 PowerShell Explicit Assignment              0.77 1x
         10000 .Add(..) to List<T>                        18.20 23.64x
         10000 += Operator to Array                     5411.23 7027.57x
         25000 PowerShell Explicit Assignment              1.39 1x
         25000 .Add(..) to List<T>                        47.14 33.91x
         25000 += Operator to Array                    26168.67 18826.38x
         50000 PowerShell Explicit Assignment              3.49 1x
         50000 .Add(..) to List<T>                        97.38 27.9x
         50000 += Operator to Array                   129537.09 37116.64x
         75000 PowerShell Explicit Assignment             14.59 1x
         75000 .Add(..) to List<T>                       243.47 16.69x
         75000 += Operator to Array                   247419.68 16958.17x
        100000 PowerShell Explicit Assignment             14.85 1x
        100000 .Add(..) to List<T>                       177.13 11.93x
        100000 += Operator to Array                   473824.71 31907.39x
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
  • Won't $DCEntries and $NonDCEntries be of different type here? Also why is assigning $DCEntries as the result of a foreach faster than the +=? What is happening there? – Matthew McDonald Nov 15 '22 at 21:36
  • @MatthewMcDonald for the first question, they will be `object[]` and ``List`1`` tho can't see why would that matter, each element of the array will still retain their type and if it did matter, the List has a `.ToArray()` method. As for why letting PowerShell capture the output from the loop being faster than `+=`, I believe I have already explained in the answer why `+=` is so slow and behind the scenes would guess PowerShell uses a `List` to capture it's output and then calls `.ToArray()` when it's done enumerating but haven't seen the source code for that. – Santiago Squarzon Nov 15 '22 at 21:53
  • Yeah I just ran across the following which also shared some of that. https://powershell.one/tricks/performance/arrays. I'm curious now, am I to understand that while Powershell automatically creates the array (and quickly) when the variable is assigned using multiple objects being returned, but there's no way to emulate that same functionality using standard arrays? **EDIT:** I sent this before seeing your last reply. Perhaps they are just calling .toarray() on some other object by default. Obv I could emulate that as well. So we don't know for sure how they accomplish this? – Matthew McDonald Nov 15 '22 at 21:56
  • @MatthewMcDonald if you don't know how many elements the collection will have, there is no way of doing it efficiently with an array. If you do know you could use `$array = [array]::CreateInstance([object], $length)` and then `$array[0] = X`, `$array[1] = Y` and so on.. and arguably this will be faster than `+=` but slower than letting PowerShell do what it was coded to do – Santiago Squarzon Nov 15 '22 at 22:00
  • I get what you're saying. I am only now curious how Powershell does it by default during a piped scenario. – Matthew McDonald Nov 15 '22 at 22:02
  • @MatthewMcDonald Can't really tell for sure, likely a List and calling `.ToArray` in the end as stated before but I don't know since I didnt inspect the source code for this. If you want to dive deep it is available for you here https://github.com/PowerShell/PowerShell – Santiago Squarzon Nov 15 '22 at 22:03
  • Thanks. On a side note, your test results seem strange to me. Why are the 10000 and 100000 tests so much faster than 1000 on the add to list operation? I noticed a similar issue in a previous edit but with explicit assignment. – Matthew McDonald Nov 15 '22 at 22:09
  • 1
    OH forgot to add... Your version of the code takes a little over 1 min vs. the original 5. Thank you. – Matthew McDonald Nov 15 '22 at 22:17
  • I believe that is due to @MatthewMcDonald [JIT Compile](https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/performance/script-authoring-considerations?view=powershell-7.3#jit-compilation) – Santiago Squarzon Nov 15 '22 at 22:17