2

I have two excel-files with over 100,000 rows. I want to merge them in PowerShell. I can quickly import them with Import-Excel. The issue is that I can't find a way to merge them that doesn't take forever.

I want an inner-join. In R or Python an inner-join takes probably less than a second and is one line of code. Is there anyway to do this in PowerShell that is quick?

For example, I tried this join-object function but it took forever: join-object

Join-Object -Left $df1 -Right $df2 -LeftJoinProperty name -RightJoinProperty name -Type OnlyIfInBoth 

Why is this so fast in other languages even with over 100,000 rows but in PowerShell it's impossibly slow? I want to use PowerShell because I don't have Python or R on some computers.

mklement0
  • 382,024
  • 64
  • 607
  • 775
Frank
  • 952
  • 1
  • 9
  • 23
  • Have you tried [this one](https://stackoverflow.com/a/45483110/7571258)? It seems to be quite popular. – zett42 Sep 02 '22 at 14:46
  • 1
    How are you populating ```$df1``` and ```$df2```? If they’re Excel COM objects then enumerating them will take a *lot* longer than reading the values from the ranges into 2D arrays, converting to lists of PSCustomObjects and then joining *those*… (or, not so clean, but would work - using Excel COM to export both files as csv files and loading them with ```Import-Csv```) – mclayton Sep 02 '22 at 16:48
  • 1
    @mclayton op seem to be using `Import-Excel` so this is just an array of `pscustomobject` – Santiago Squarzon Sep 02 '22 at 17:45

2 Answers2

1

Perhaps using this will be faster:

# create a Hashtable, which works case-insensitive by default
$right = @{}
foreach ($name in $df2.Name) { $right[$name] = $true }  # the value doesn't matter

# now loop over the left array ($df1) and output only 
# the rows where the name is found in the $right Hashtable
$result = foreach ($item in $df1) {
    if ($right.ContainsKey($item.Name)) {
        $item
    }
}

If you want to compare the names case-sensitively, you can use a HashSet:

$right = [System.Collections.Generic.HashSet[String]]::new()
foreach ($name in $df2.Name) { [void]$right.Add($name) } 

# now loop over the left array ($df1) and output only 
# the rows where the name is found in the $right HashSet
$result = foreach ($item in $df1) {
    if ($right.Contains($item.Name)) {
        $item
    }
}

You can also try if Compare-Object would be faster in your case:

$result = Compare-Object -ReferenceObject $df1 -DifferenceObject $df2 -Property Name -IncludeEqual -ExcludeDifferent -PassThru

Compare-Object also has a -CaseSensitive switch if that is what you need


Based on the comment you posted, you can use the first option mentioned above and change the final loop into this:

# now loop over the left array ($df1) and output only 
# the rows where the name is found in the $right Hashtable
# an inner loop merges the properties 
$result = foreach ($item in $df1) {
    if ($right.ContainsKey($item.Name)) {
        $merge = $df2 | Where-Object {$_.Name -eq $item.Name}
        $merge.PsObject.Properties.Name | 
            Where-Object {$item.PsObject.Properties.Name -notcontains $_} | 
            ForEach-Object {
                $item | Add-Member -MemberType NoteProperty -Name $_ -Value $merge."$_"
            }
        $item
    }
}

$result
Theo
  • 57,719
  • 8
  • 24
  • 41
  • The first option worked really well but how to merge after? – Frank Sep 07 '22 at 13:42
  • @Frank you mean how to save to file? Use `$result | Export-Csv -Path 'X:\whereEver\thefile.csv` -UseCulture -NoTypeInformation` – Theo Sep 07 '22 at 13:59
  • The ```$result``` you have is a list of keys in $df1 that are in $df2 and that worked really quickly. But the final result I need is an object with all properties from $df1 and all properties of $df2 (inner-join of two tables basically) – Frank Sep 07 '22 at 14:07
  • @Frank Please see the edit to my answer – Theo Sep 09 '22 at 11:46
0

The .NET method Linq.Enumerable.Join seems to be pretty fast. As it's not immediately obvious how to use it from PowerShell (you need some casts to be able to pass ScriptBlocks to its selector parameters), I've wrapped it in a PowerShell function Join-Enumerable to make it easier to use.

Function Join-Enumerable{
    [CmdletBinding()]
    param (
        [Parameter(Mandatory)] 
        [Collections.IEnumerable] $Left,
        
        [Parameter(Mandatory)] 
        [Collections.IEnumerable] $Right,

        [Parameter(Mandatory)] 
        $LeftKey,
        
        [Parameter(Mandatory)] 
        $RightKey,      
        
        [Parameter(Mandatory)] 
        [scriptblock] $Result,

        [Parameter()] 
        [switch] $NoEnumerate
     )

    if( $RightKey -isnot [scriptblock] ) {
        $RightKey = "$RightKey" -replace "'", "''"
        $RightKey = [scriptblock]::Create( "`$args[0].'$RightKey'" )
    }
    if( $LeftKey -isnot [scriptblock] ) {
        $LeftKey = "$LeftKey" -replace "'", "''"
        $LeftKey = [scriptblock]::Create( "`$args[0].'$LeftKey'" )
    }

    $enumerator = [Linq.Enumerable]::Join( $Left, $Right, 
        [Func[object, object]] $RightKey, 
        [Func[object, object]] $LeftKey, 
        [Func[object, object, object]] $Result )

    if( $NoEnumerate ) {
        return , $enumerator  # Outputs the enumerator itself
    }

    $enumerator  # Enumerates and outputs each result element
}

Demo:

$df1 = ConvertFrom-Csv @'
Name,Foo
foo,1
bar,2
baz,3
'@

$df2 = ConvertFrom-Csv @'
Name,Bar
bar,4
bam,5
foo,6
'@

$joinedArray = Join-Enumerable -Left $df1 -Right $df2 -LeftKey Name -RightKey Name -Result {
    [PSCustomObject] @{ 
        Name = $args[0].Name
        Foo  = $args[0].Foo
        Bar  = $args[1].Bar
    }
}

$joinedArray  # Output

Output:

Name Foo Bar
---- --- ---
foo  1   6
bar  2   4

Remarks:

  • Parameters -Left and -Right specify the two arrays (more exactly anything that implements the IEnumerable interface) to join
  • Parameters -LeftKey and -RightKey specify the keys to join on from elements of the -Left and -Right arguments. These can be either the name of a property or a scriptblock that gets the current element passed as an argument and outputs the key value.
  • Parameter -Result takes a scriptblock that specifies how to generate each output object. It takes two arguments, the 1st is an element from the -Left argument and the 2nd is an element from the -Right argument.
  • While Linq.Enumerable.Join doesn't actually return the joined objects, but an enumerator that lazily evaluates the result, PowerShell commands are expected to output the actual objects. This is what the function does by default. If you really need the enumerator, pass -NoEnumerate to the function.
  • For maximum performance, arrays of class objects should be passed for -Left and -Right and the result should be instanciated using a class instead of PSCustomObject. If you don't control the type of the input objects, using a class for the output objects still gives a noticable performance improvement and uses less memory:
    class FooBar { 
        [string] $Name
        [int] $Foo
        [int] $Bar
    }
    $joinedArray = Join-Enumerable -Left $df1 -Right $df2 -LeftKey Name -RightKey Name -Result {
        [FooBar] @{ 
            Name = $args[0].Name
            Foo  = $args[0].Foo
            Bar  = $args[1].Bar
        }
    }
    

Performance tests:

For comparing performance I used the this code. Each test script creates two arrays of 100000 elements each, with random name properties. The name properties of the 2nd array match elements from the 1st array randomly. 10 runs are done for each method and the average duration is taken.

Method Input Output Duration
Join-Enumerable class object class object 1.4 s
Join-Enumerable class object [PSCustomObject] 2.62 s
Join-Enumerable [PSCustomObject] [PSCustomObject] 4.6 s
Join-Object (iRon) class object [PSCustomObject] 36.2 s
Join-Object (RamblingCookieMonster) class object [PSCustomObject] 44.4 s
zett42
  • 25,437
  • 3
  • 35
  • 72
  • You have to manually specify the result object? What if the Excels have 100+ columns each? – Frank Sep 07 '22 at 12:59
  • 1
    @Frank Using the function presented in this answer: yes. In general though, it should be possible to generate the output object dynamically, based on the input data. I'm still looking for an elegant way to do this, that doesn't adversely affect performance compared to a manually created object. – zett42 Sep 07 '22 at 14:02
  • @Frank Here is a [proof-of-concept version of the function](https://gist.github.com/zett42/c7f162846ca5c9d959115ef807416c7d) that generates the output object dynamically and is as fast as the code that manually specifies the output object. So you can now omit the `-Result` argument. It is still somewhat limited, but it shouldn't be a big deal to extend it (e. g. resolving conflicting property names). – zett42 Sep 07 '22 at 22:11