Batching data before passing through the pipeline

Question

I have a bunch of fileshares with many millions of files/folders on them. I am using gci -Recurse to get a full list of directories/files on the share, and I need to load several pieces of information from that gci into SQL server for additional analysis. The command I'm using to get the data is:

gci $SharePath -Recurse | select FullName, Attributes, Length, CreationTimeUtc, LastAccessTimeUtc, LasWriteTimeUtc

Now I could just pipe that to Write-SQLTableData using the recommended syntax to force Bulk-Inserts as suggested in Option 3 on Microsoft's Write-SqlTableData documentation page, like this:

$Params = @{
    ServerInstance = 'sqlservername'
    DatabaseName = 'databasename'
    SchemaName = 'dbo'
}
,(gci $SharePath -Recurse | select FullName, Attributes, Length, CreationTimeUtc, LastAccessTimeUtc, LasWriteTimeUtc) | Write-SqlTableData @Params -TableName 'Table1'

The results of this, however, is that the gci takes several hours to complete with no feedback and using up many GB of memory and slowing my machine to a crawl before finally dumping all of the data to SQL. If I leave off the ,( and the matching ), the data moves to SQL as it is generated, however the SQL server is pounded with millions of individual inserts.

What I'm looking for is an intermediate answer which uses the pipeline. I know I can store the gci results in a variable $gciresults and then pass 1000 lines at a time to SQL using $gciresults[0..999] and so on, but I am trying to utilize the pipeline so I don't use up too much memory. Ideally, there would be some cmdlet I'll call batching-cmdlet which would allow me to split my incoming data into bite-sized chunks without storing it all in memory first, like this:

gci ... | select FullName, ... | batching-cmdlet -batchsize 1000 | Write-SqlTableData @Params -TableName 'Table1'

Searches for such a cmdlet were unsuccessful. Does anyone have thoughts how I might accomplish this?

mklement0 · Accepted Answer · 2020-04-21T16:40:19.020

As of PowerShell 7.0, there is no batching (partitioning) mechanism, unfortunately.

Adding one to Select-Object has been proposed in this GitHub issue.
This related answer provides a custom implementation, as function Select-Chunk.

Therefore, you'll have to implement batching yourself for now:

# Create an aux. queue for batching the objects.
$batchSize = 1000
$batch = [System.Collections.Generic.Queue[pscustomobject]]::new($batchSize)

Get-ChildItem $SharePath -Recurse | 
  Select-Object FullName, Attributes, Length, CreationTimeUtc, LastAccessTimeUtc, LasWriteTimeUtc |
    ForEach-Object { 
      $batch.Enqueue($_) # add object to the batch
      if ($batch.Count -eq $batchSize) { # batch is full, write to table.
        # send batch as a whole through the pipeline
        , $batch | Write-SqlTableData @Params -TableName Table1
        $batch.Clear() # start next batch
      }
    }

# Process any remaining items.
if ($batch.Count) {
  , $batch | Write-SqlTableData @Params -TableName Table1
}

This kind of structure was exactly what I was looking for, but I wasn't aware of the .NET Queue Collections type. I plan to turn this into a cmdlet and add it to my permanent toolkit. Running the code this way gives me the ability to get feedback every 1000 lines to know where I am in the process, and is very memory efficient to boot. Thanks very much for taking the time to do this @mklement0! — Matthew, Apr 21 '20 at 15:09

score 2 · Answer 2 · edited Apr 30 '20 at 01:37

Using the framework outlined by @mklement0 in his accepted answer, I wrote the following Split-PipelineData cmdlet which takes pipeline input and passes it downline in user-definable batches. Note, turns out that this is very similar to the function in the post linked by @mklement0, however I also added the ability to report progress using write-progress.

<#
.Synopsis
    Takes pipeline objects one at a time and sends them on in batches.
.DESCRIPTION
    Takes pipeline objects one at a time and sends them on in batches.  Allows user selectable values for
    batch size and feedback options.
#>
Function Split-PipelineData
{
    [CmdletBinding(DefaultParameterSetName='Default')]
    Param
    (
        # PipelineData
        [Alias('PipelineData')]
        [Parameter(ParameterSetName='Default',Mandatory=$true,ValueFromPipeline=$true,Position=0)]
        [Parameter(ParameterSetName='Progress',Mandatory=$true,ValueFromPipeline=$true,Position=0)]
        $InputObject,

        # Batch size for sending on to the pipeline
        [Parameter(ParameterSetName='Default',Mandatory=$false)]
        [Parameter(ParameterSetName='Progress',Mandatory=$false)]
        [int]$BatchSize=1000,

        # If set, Progress will use Write-Progress to display progress information
        [Parameter(ParameterSetName='Progress',Mandatory=$true)]
        [switch]$Progress,

        # Passthru to Write-Progress ID parameter
        [Parameter(ParameterSetName='Progress',Mandatory=$false)]
        [int]$ProgressID=0,

        # Passthru to Write-Progress ParentID parameter
        [Parameter(ParameterSetName='Progress',Mandatory=$false)]
        [int]$ProgressParentID=-1,

        # Passthru to Write-Progress Activity parameter. Default is 'Batching pipeline data'.
        [Parameter(ParameterSetName='Progress',Mandatory=$false)]
        [int]$ProgressActivity=$null,

        # Report progress after this many records.  Defaults to same as BatchSize
        [Parameter(ParameterSetName='Progress',Mandatory=$false)]
        [int]$ProgressBatchSize=$null,

        # Total Record count (if known) to be used in progress
        [Parameter(ParameterSetName='Progress',Mandatory=$false)]
        [int]$TotalRecords=$null
    )

    Begin
    {
        $Batch = [System.Collections.Generic.Queue[pscustomobject]]::new($BatchSize)
        [int64]$RecordCounter = 0
        If ($Progress)
        {
            $ProgressParams = @{
                Activity = If ($ProgressActivity) {$ProgressActivity} Else {'Batching pipeline data'}
                Status = ''
                ID = $ProgressID
                ParentID = $ProgressParentID
                PercentComplete = -1
            }
            If ($ProgressBatchSize -in $null,0) {$ProgressBatchSize = $BatchSize}
        }
    }
    Process
    {
        $RecordCounter++

        #Add record to batch
        $Batch.Enqueue($_)

        #Report progress if necessary
        If ($Progress -and $RecordCounter % $ProgressBatchSize-eq 0)
        {
            If ($TotalRecords)
            {
                $ProgressParams.Status = "Piping record $RecordCounter/$TotalRecords"
                $ProgressParams.PercentComplete = [int](100*($RecordCounter/$TotalRecords))
            }
            Else
            {
                $ProgressParams.Status = "Piping record $RecordCounter"
            }
            Write-Progress @ProgressParams
        }

        #Pass batch on if it has reached its threshhold
        if ($Batch.Count -eq $BatchSize)
        { 
            ,($Batch)
            $Batch.Clear() # start next batch
        }
    }
    End
    {
        #Report final progress if necessary
        If ($Progress)
        {
            If ($TotalRecords)
            {
                $ProgressParams.Status = "Piping record $RecordCounter/$TotalRecords"
                $ProgressParams.PercentComplete = [int](100)
            }
            Else
            {
                $ProgressParams.Status = "Piping record $RecordCounter"
            }
            Write-Progress @ProgressParams
        }

        #Pass remaining records on and clear variable
        ,($Batch)
        $Batch.Clear()
        Remove-Variable Batch

        #Clear progress bars if necessary
        If ($Progress)
        {
            $ProgressParams.Activity = 'Completed'
            If ($ProgressParams.ContainsKey('Status')) {$ProgressParams.Remove('Status')}
            Write-Progress @ProgressParams -Completed
        }
    }
}

Yes, in essence this is quite similar to the `Select-Chunk` function from my linked answer, but the progress-bar support is a nice addition. Two quibbles: I suggest renaming `$PipelineData` to `$InputObject` for consistency with other cmdlets; the verb `Split` is a bit confusing, since what you're doing is _batching_ or _chunking_; while there is no suitable approved verb, you can use the noun part, such as in `Select-Batch` (don't think you need the word "pipeline" in the name). — mklement0, Apr 21 '20 at 16:54
I split the difference and changed the parameter, but added a parameter alias for 'PipelineData' for my own purposes. I vacillated on the verb a bunch of times before I settled on Split because we're not really selecting anything, we're splitting a stream. I agree `Chunk-Stream` might be a better description, but I know I personally won't think `Select` when I'm splitting/chunking. Agree to disagree on that one? — Matthew, Apr 21 '20 at 19:51
I see your point, and any name chose here will be a compromise; we don't need to agree on a name, but I'll leave you with this thought: you can think of `Select-Chunk` as "selecting chunks of input objects from the entire set of input objects". `Split-Chunk` reverses this perspective. — mklement0, Apr 21 '20 at 20:03

Batching data before passing through the pipeline

2 Answers2