0

I have a large data array (Invoke-Sqlcmd export, array variable), lets say it contains all of our 'customers'. I have large data list (Local search results, string variable), lets say it contains all of our 'paying customers' I want to create a final array by subtracting the people in the list from the array, leaving me with 'customers who have not paid yet'. The code is doing exactly this but runs very slow, Im hoping to educate myeself on a quicker way.

The size of the first array is around 25,000 and the size of the list is around 20,000. Because the list is just a plain text variable, I first split it on new lines to allow the ForEach to take place (otherwise it registers as one object) The code im using to do this is:

$NewArray = $FirstArray

ForEach ($Customer In $List)
{
$NewArray = $NewArray | ? {$_.CustomerID -ne $Customer}
}

any help greatly appreciated, Thanks

  • 1
    Have a look at [`Compare-Object`](https://learn.microsoft.com/powershell/module/microsoft.powershell.utility/compare-object). While it'll perform better than a loop in PowerShell code, it's no speed demon either. – mklement0 Jul 26 '22 at 16:22
  • 1
    Using this [`Join-Object script`](https://www.powershellgallery.com/packages/Join)/[`Join-Object Module`](https://www.powershellgallery.com/packages/JoinModule) (see also: [In Powershell, what's the best way to join two tables into one?](https://stackoverflow.com/a/45483110/1701026)): `$List |OuterJoin $FirstArray -On CustomerId` – iRon Jul 26 '22 at 16:35
  • 1
    Good suggestions. Another way is to use a `Hashtable`, which is a great topic to learn in any case. Hint: `Hashtable` has a `Remove(Key)` method, that provides fast search-and-remove operation. – zett42 Jul 26 '22 at 16:45
  • 1
    how about `$list.Where{$_ -notin $firstArray.CustomerID}`? – Abraham Zinala Jul 26 '22 at 18:21
  • 2
    @AbrahamZinala This would propably be even worse than OPs code as it creates a temporary array of all `CustomerID` properties from `$firstArray` for *each* element of `$list`. Then it linearly searches the array for matching `CustomerID`, also for *each* element of `$list`. – zett42 Jul 26 '22 at 21:02
  • 1
    @zett42, so the better approach would be to use `$firstArray.CustomerID.Where{$_ -notin $list}`? `-notin` replaces the foreach statement in this case. In all honesty, I got confused on what they were asking so I threw out a suggestion:) – Abraham Zinala Jul 26 '22 at 21:05
  • 2
    @AbrahamZinala Better, but `-notin` still uses linear search. That can only be improved using a data structure optimized for fast lookup, such as a hashtable or a hashset. You could also sort the input array by `CustomerID` and then do a binary search, but that is still slower than hashtable lookup, for such a big array. – zett42 Jul 26 '22 at 21:12
  • 1
    @zett42, ahh wait, I stand corrected. Nonetheless, restructuring the list into an array isn't efficient for the OP's immediate use. Perhaps from here on out it's the way forward. – Abraham Zinala Jul 26 '22 at 21:23
  • Thanks all, I've taken all suggestions onboard and will give it some testing with the response provided by MClayton below, really appreciate it great work all. – RhysPickett Jul 27 '22 at 09:15

1 Answers1

1

I've done a bit of performance testing, and by far and away the fastest approach is @zett42's suggestion of converting the customer list into a hashtable.

Test Data

First, here's a little snippet to set up some test data. It has some unrealistic properties like being sorted by id, with the first 80% being paid customers and the last 20% unpaid, but I don't think that will affect the results much.

# build some test data
$count = 25000

$allCustomers = @(
    1..$count | foreach-object {
        [pscustomobject] [ordered] @{
            "CustomerID" = $_
        }
    }
);

$paidIds = [int[]] @( 1..($count * 0.8) )

Original Approach

Here's the OP's approach as a baseline:

Measure-Command {
    $NewArray = @( $allCustomers );
    foreach( $paidId In $paidIds )
    {
        $NewArray = $NewArray | where-object { $_.CustomerID -ne $paidId }
    }
}

# TotalMilliseconds : 37732.532

Hashtable

And here's @zett42's suggestion of using a hashtable to do fast removal of customers by CustomerId when we find them in the paid list:

Measure-Command {

    # convert customers into a hashtable for fast removal by id
    $lookups = @{};
    foreach( $customer in $allCustomers )
    {
        $lookups.Add($customer.CustomerId, $customer)
    }

    # remove all the paid customer ids
    foreach( $paidId In $paidIds )
    {
        $lookups.Remove($paidId)
    }

    # get the remaining unpaid customers
    $NewArray = @( $lookups.Values )

}

# TotalMilliseconds : 53.9363

-notin

@AbrahamZinala's suggestion of -notin also works pretty well if you want a quick one-liner and don't mind not having the raw blazing speed of the hashtable approach.

Measure-Command {
    $NewArray = $allCustomers | where-object { $_.CustomerId -notin $paidIds };
}

# TotalMilliseconds : 698.2812

The hashtable approach scales better with larger datasets, so if you don't mind the little bit of extra setup that might be the one to go for...

mclayton
  • 8,025
  • 2
  • 21
  • 26
  • Thanks for this thorough breakdown, I have improved my process phenominally thanks to all of the suggestions in here, really appreciate this thank you. I currently have the one-liner working which has improved the speed dramatically and im working towards improving my knowledge of hash tables, thanks again! – RhysPickett Jul 27 '22 at 09:35