Efficient way to remove duplicates from large 2D arrays in PowerShell

Question

I have a large set of data roughly 10 million items that I need to process efficiently and quickly removing duplicate items based on two of the six column headers.

I have tried grouping and sorting items but it's horrendously slow.

$p1 = $test | Group-Object -Property ComputerSeriaID,ComputerID 
    $p2 =  foreach ($object in $p1.group) { 
           $object | Sort-Object -Property FirstObserved | Select-Object -First 1
        }

The goal would be to remove duplicates by assessing two columns while maintaining the oldest record based on first observed.

The data looks something like this:

LastObserved : 2019-06-05T15:40:37 FirstObserved : 2019-06-03T20:29:01 ComputerName : 1 ComputerID : 2 Virtual : 3 ComputerSerialID : 4

LastObserved : 2019-06-05T15:40:37 FirstObserved : 2019-06-03T20:29:01 ComputerName : 5 ComputerID : 6 Virtual : 7 ComputerSerialID : 8

LastObserved : 2019-06-05T15:40:37 FirstObserved : 2019-06-03T20:29:01 ComputerName : 9 ComputerID : 10 Virtual : 11 ComputerSerialID : 12

please change your title - you are NOT talking about a 2d array ... you are talking about a ONE-d array that you want to filter based on two of the properties in objects in that array. [*grin*] — Lee_Dailey, Jun 05 '19 at 19:15
Honestly, with that many items, why aren't you using a real database to do this work? I suppose you could build a hashtable, and perform lookups against that, but that is going to be slow too compared to using a tool that's built for handling large data sets. — TheMadTechnician, Jun 05 '19 at 19:19

score 0 · Answer 1 · answered Jun 05 '19 at 23:16

You might want to clean up your question a little bit, because it's a little bit hard to read, but I'll try to answer the best I can with what I can understand about what you're trying to do.

Unfortunately, with so much data there's no way to do this quickly. String Comparison and sorting are done by brute force; there is no way to reduce the complexity of comparing each character in one string against another any further than measuring them one at a time to see if they're the same.

(Honestly, if this were me, I'd just use export-csv $object and perform this operation in excel. The time tradeoff to scripting something like this only once just wouldn't be worth it.)

By "Items" I'm going to assume that you mean rows in your table, and that you're not trying to retrieve only the strings in the rows you're looking for. You've already got the basic idea of select-object down, you can do that for the whole table:

$outputFirstObserved = $inputData | Sort-Object -Property FirstObserved -Unique 
$outputLastObserved = $inputData | Sort-Object -Property LastObserved -Unique

Now you have ~20 million rows in memory, but I guess that beats doing it by hand. All that's left is to join the two tables. You can download that Join-Object command from the powershell gallery with Install-Script -Name Join and use it in the way described. If you want to do this step yourself, the easiest way would be to squish the two tables together and sort them again:

$output = $outputFirstObserved + $outputLastObserved
$return = $output | Sort-Object | Get-Unique

Regarding Excel, it still can't handle more than [about a million rows.](https://support.office.com/en-us/article/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3) — Booga Roo, Jun 05 '19 at 23:27

score 0 · Answer 2 · answered Jun 06 '19 at 00:59

0

Does this do it? It keeps the one it finds first.

$test | sort -u ComputerSeriaID, ComputerID

answered Jun 06 '19 at 00:59

js2010

23,033
6
64
66

Rie Dodsworth · Answer 3 · 2020-06-13T05:44:12.243

I created this function to de-duplicate my multi-dimensional arrays. Basically, I concatenate the contents of the record, add this to a hash. If the concatenate text already exists in the hash, don't add it to the array to be returned.

Function DeDupe_Array
{
param
(
$Data
)
$Return_Array = @()
$Check_Hash = @{}



Foreach($Line in $Data)
    {
        $Concatenated = ''

        $Elements = ($Line | Get-Member -MemberType NoteProperty | % {"$($_.Name)"})

        foreach($Element in $Elements)
        {
            $Concatenated += $line.$Element
        }

        If($Check_Hash.$Concatenated -ne 1)
        {
            $Check_Hash.add($Concatenated,1)
            $Return_Array += $Line
        }
    }

return $Return_Array
}

swbbl · Answer 4 · 2020-06-13T14:06:01.837

Try the following script.

Should be as fast as possible due to avoiding any pipe'ing in PS.

$hashT = @{}
foreach ($item in $csvData) {
    # Building hash table key
    $key = '{0}###{1}' -f $item.ComputerSeriaID, $item.ComputerID 

    # if $key doesn't exist yet OR when $key exists and "FirstObserverd" is less than existing one in $hashT (only valid when date provided in sortable format / international format)
    if ((! $hashT.ContainsKey($key)) -or ( $item.FirstObserved -lt $hashT[$key].FirstObserved )) {
        $hashT[$key] = $item
    }
}
$result = $hashT.Values

Efficient way to remove duplicates from large 2D arrays in PowerShell

4 Answers4