2

This article shows how to use Invoke-Async in PowerShell: https://sqljana.wordpress.com/2018/03/16/powershell-sql-server-run-in-parallel-collect-sql-results-with-print-output-from-across-your-sql-farm-fast/

I wish to run in parallel the copy-item cmdlet in PowerShell because the alternative is to use FileSystemObject via Excel and copy one file at a time out of a total of millions of files.

I have cobbled together the following:

.SYNOPSIS
<Brief description>
For examples type:
Get-Help .\<filename>.ps1 -examples
.DESCRIPTION
Copys files from one path to another
.PARAMETER FileList
e.g. C:\path\to\list\of\files\to\copy.txt
.PARAMETER NumCopyThreads
default is 8 (but can be 100 if you want to stress the machine to maximum!)
.EXAMPLE
.\CopyFilesToBackup -filelist C:\path\to\list\of\files\to\copy.txt
.NOTES
#>

[CmdletBinding()] 
Param( 
    [String] $FileList = "C:\temp\copytest.csv", 
    [int] $NumCopyThreads = 8
) 

$filesToCopy = New-Object "System.Collections.Generic.List[fileToCopy]"
$csv = Import-Csv $FileList

foreach($item in $csv)
{
    $file = New-Object fileToCopy
    $file.SrcFileName = $item.SrcFileName
    $file.DestFileName = $item.DestFileName
    $filesToCopy.add($file)
}

$sb = [scriptblock] {
    param($file)
    Copy-item -Path $file.SrcFileName -Destination $file.DestFileName
}
$results = Invoke-Async -Set $filesToCopy -SetParam file -ScriptBlock $sb -Verbose -Measure:$true -ThreadCount 8
$results | Format-Table

Class fileToCopy {
    [String]$SrcFileName = ""
    [String]$DestFileName = ""
}

the csv input for which looks like this:

SrcFileName,DestFileName
C:\Temp\dummy-data\101438\101438-0154723869.zip,\\backupserver\Project Archives\101438\0154723869.zip
C:\Temp\dummy-data\101438\101438-0165498273.xlsx,\\backupserver\Project Archives\101438\0165498273.xlsx

What am I missing to get this working, because when I run .\CopyFiles.ps1 -FileList C:\Temp\test.csv nothing happens. The files exist in the source path, but the file objects aren't being pulled from the -Set collection. (Unless I have misunderstood how the collection is used?)

No, I can't use robocopy to do this because there are millions of files which resolve to different paths depending upon their original location.

AlexFielder
  • 137
  • 1
  • 12
  • put the class declaration of "fileToCopy" before you are using it in line $filesToCopy = New-Object "System.Collections.Generic.List[fileToCopy]" – f6a4 Aug 27 '19 at 13:44
  • 1
    @f6a4: Unlike functions, classes are parsed before execution, and are therefore "hoisted". That is, they needn't be defined _before_ they're used; try `[foo].Name; class Foo {}` – mklement0 Aug 27 '19 at 13:57
  • 1
    Thanks mklement0 I knew what I had was correct; the code up to the Invoke-Async line all works just fine as it is. If what @f6a4 had said was true, it wouldn't work at all. And moving the class declaration does nothing to affect the overall "shrug" generated by Invoke-Async – AlexFielder Aug 27 '19 at 14:02
  • Does `.\CopyFiles.ps1 -FileList C:\temp\test.csv -Verbose` output anything interesting? – Mathias R. Jessen Aug 27 '19 at 16:06
  • all it does now is output a bunch of errors. One thing I found is that when I succeeded in passing the Class I made to the Invoke-Async module, it complains that it can't find the type. I've tried using a PSCustomObject but to no avail. – AlexFielder Aug 27 '19 at 19:23
  • I just re-read your last reply @MathiasR.Jessen and here's the output: https://1drv.ms/u/s!AiGmKryqFliPrYFvpCeZAIUuNYLYZA?e=umh3KN – AlexFielder Aug 27 '19 at 20:26

1 Answers1

4

I have no explanation for your symptom based on the code in your question (see bottom section), but I suggest basing your solution on the (now) standard Start-ThreadJob cmdlet (comes with PowerShell Core; in Windows PowerShell, install it with Install-Module ThreadJob -Scope CurrentUser, for instance[1]):

Such a solution is more efficient than use of the third-party Invoke-Async function, which as of this writing is flawed in that it waits for jobs to finish in a tight loop, which creates unnecessary processing overhead.

Start-ThreadJob jobs are a lightweight, thread-based alternative to the process-based Start-Job background jobs, yet they integrate with the standard job-management cmdlets, such as Wait-Job and Receive-Job.

Here's a self-contained example based on your code that demonstrates its use:

Note: Whether you use Start-ThreadJob or Invoke-Async, you won't be able to explicit reference custom classes such as [fileToCopy] in the script block that runs in separate threads (runspaces; see bottom section), so the solution below simply uses [pscustomobject] instances with the properties of interest for simplicity and brevity.

# Create sample CSV file with 10 rows.
$FileList = Join-Path ([IO.Path]::GetTempPath()) "tmp.$PID.csv"
@'
Foo,SrcFileName,DestFileName,Bar
1,c:\tmp\a,\\server\share\a,baz
2,c:\tmp\b,\\server\share\b,baz
3,c:\tmp\c,\\server\share\c,baz
4,c:\tmp\d,\\server\share\d,baz
5,c:\tmp\e,\\server\share\e,baz
6,c:\tmp\f,\\server\share\f,baz
7,c:\tmp\g,\\server\share\g,baz
8,c:\tmp\h,\\server\share\h,baz
9,c:\tmp\i,\\server\share\i,baz
10,c:\tmp\j,\\server\share\j,baz
'@ | Set-Content $FileList

# How many threads at most to run concurrently.
$NumCopyThreads = 8

Write-Host 'Creating jobs...'
$dtStart = [datetime]::UtcNow

# Import the CSV data and transform it to [pscustomobject] instances
# with only .SrcFileName and .DestFileName properties - they take
# the place of your original [fileToCopy] instances.
$jobs = Import-Csv $FileList | Select-Object SrcFileName, DestFileName | 
  ForEach-Object {
    # Start the thread job for the file pair at hand.
    Start-ThreadJob -ThrottleLimit $NumCopyThreads -ArgumentList $_ { 
      param($f) 
      $simulatedRuntimeMs = 2000 # How long each job (thread) should run for.
      # Delay output for a random period.
      $randomSleepPeriodMs = Get-Random -Minimum 100 -Maximum $simulatedRuntimeMs
      Start-Sleep -Milliseconds $randomSleepPeriodMs
      # Produce output.
      "Copied $($f.SrcFileName) to $($f.DestFileName)"
      # Wait for the remainder of the simulated runtime.
      Start-Sleep -Milliseconds ($simulatedRuntimeMs - $randomSleepPeriodMs)
    }
  }

Write-Host "Waiting for $($jobs.Count) jobs to complete..."

# Synchronously wait for all jobs (threads) to finish and output their results
# *as they become available*, then remove the jobs.
# NOTE: Output will typically NOT be in input order.
Receive-Job -Job $jobs -Wait -AutoRemoveJob
Write-Host "Total time lapsed: $([datetime]::UtcNow - $dtStart)"

# Clean up the temp. file
Remove-Item $FileList

The above yields something like:

Creating jobs...
Waiting for 10 jobs to complete...
Copied c:\tmp\b to \\server\share\b
Copied c:\tmp\g to \\server\share\g
Copied c:\tmp\d to \\server\share\d
Copied c:\tmp\f to \\server\share\f
Copied c:\tmp\e to \\server\share\e
Copied c:\tmp\h to \\server\share\h
Copied c:\tmp\c to \\server\share\c
Copied c:\tmp\a to \\server\share\a
Copied c:\tmp\j to \\server\share\j
Copied c:\tmp\i to \\server\share\i
Total time lapsed: 00:00:05.1961541

Note that the output received does not reflect the input order, and that the overall runtime is roughly 2 times the per-thread runtime of 2 seconds (plus overhead), because 2 "batches" have to be run due to the input count being 10, whereas only 8 threads were made available.

If you upped the thread count to 10 or more (50 is the default), the overall runtime would drop to 2 seconds plus overhead, because all jobs then run concurrently.

Caveat: The above numbers stem from running in PowerShell Core, version on Microsoft Windows 10 Pro (64-bit; Version 1903), using version 2.0.1 of the ThreadJob module.
Inexplicably, the same code is much slower in Windows PowerShell, v5.1.18362.145.


However, for performance and memory consumption it is better to use batching (chunking) in your case, i.e, to process multiple file pairs per thread.

The following solution demonstrates this approach; tweak $chunkSize to find a batch size that works for you.

# Create sample CSV file with 10 rows.
$FileList = Join-Path ([IO.Path]::GetTempPath()) "tmp.$PID.csv"
@'
Foo,SrcFileName,DestFileName,Bar
1,c:\tmp\a,\\server\share\a,baz
2,c:\tmp\b,\\server\share\b,baz
3,c:\tmp\c,\\server\share\c,baz
4,c:\tmp\d,\\server\share\d,baz
5,c:\tmp\e,\\server\share\e,baz
6,c:\tmp\f,\\server\share\f,baz
7,c:\tmp\g,\\server\share\g,baz
8,c:\tmp\h,\\server\share\h,baz
9,c:\tmp\i,\\server\share\i,baz
10,c:\tmp\j,\\server\share\j,baz
'@ | Set-Content $FileList

# How many threads at most to run concurrently.
$NumCopyThreads = 8

# How many files to process per thread
$chunkSize = 3

# The script block to run in each thread, which now receives a
# $chunkSize-sized *array* of file pairs.
$jobScriptBlock = { 
  param([pscustomobject[]] $filePairs)
  $simulatedRuntimeMs = 2000 # How long each job (thread) should run for.
  # Delay output for a random period.
  $randomSleepPeriodMs = Get-Random -Minimum 100 -Maximum $simulatedRuntimeMs
  Start-Sleep -Milliseconds $randomSleepPeriodMs
  # Produce output for each pair.  
  foreach ($filePair in $filePairs) {
    "Copied $($filePair.SrcFileName) to $($filePair.DestFileName)"
  }
  # Wait for the remainder of the simulated runtime.
  Start-Sleep -Milliseconds ($simulatedRuntimeMs - $randomSleepPeriodMs)
}

Write-Host 'Creating jobs...'
$dtStart = [datetime]::UtcNow

$jobs = & {

  # Process the input objects in chunks.
  $i = 0
  $chunk = [pscustomobject[]]::new($chunkSize)
  Import-Csv $FileList | Select-Object SrcFileName, DestFileName | ForEach-Object {
    $chunk[$i % $chunkSize] = $_
    if (++$i % $chunkSize -ne 0) { return }
    # Note the need to wrap $chunk in a single-element helper array (, $chunk)
    # to ensure that it is passed *as a whole* to the script block.
    Start-ThreadJob -ThrottleLimit $NumCopyThreads -ArgumentList (, $chunk) -ScriptBlock $jobScriptBlock
    $chunk = [pscustomobject[]]::new($chunkSize) # we must create a new array
  }

  # Process any remaining objects.
  # Note: $chunk -ne $null returns those elements in $chunk, if any, that are non-null
  if ($remainingChunk = $chunk -ne $null) { 
    Start-ThreadJob -ThrottleLimit $NumCopyThreads -ArgumentList (, $remainingChunk) -ScriptBlock $jobScriptBlock
  }

}

Write-Host "Waiting for $($jobs.Count) jobs to complete..."

# Synchronously wait for all jobs (threads) to finish and output their results
# *as they become available*, then remove the jobs.
# NOTE: Output will typically NOT be in input order.
Receive-Job -Job $jobs -Wait -AutoRemoveJob
Write-Host "Total time lapsed: $([datetime]::UtcNow - $dtStart)"

# Clean up the temp. file
Remove-Item $FileList

While the output is effectively the same, note how only 4 jobs were created this time, each of which processed (up to) $chunkSize (3) file pairs.


As for what you tried:

The screen shot you show suggests that the problem is that your custom class, [fileToCopy], isn't visible to the script block run by Invoke-Async.

Since Invoke-Async invokes the script block via the PowerShell SDK in separate runspaces that know nothing about the caller's state, it is to be expected that these runspaces don't know your class (this equally applies to Start-ThreadJob).

However, it is unclear why that is a problem in your code, because your script block doesn't make an explicit reference to you class: your script-block parameter $file is not type-constrained (it is implicitly [object]-typed).

Therefore, simply accessing the properties of your custom-class instance inside the script block should work, and indeed does in my tests on Windows PowerShell v5.1.18362.145 on Microsoft Windows 10 Pro (64-bit; Version 1903).

However, if your real script-block code were to explicitly reference custom class [fileToCopy] - such as by defining the parameter as param([fileToToCopy] $file) - you would see the symptom.


[1] In Windows PowerShell v3 and v4, which do not come with the PowerShellGet module, Install-Module isn't available by default. However, the module can be installed on demand, as described in Installing PowerShellGet.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • I hadn't seen this until just now, and it's late here @mklement0 so I shall read and digest this in the morning. Thank you very much for your input. – AlexFielder Aug 28 '19 at 22:31
  • 1
    @AlexFielder: My pleasure; it was a learning experience for me too - I hope I got everything right; happy to amend the answer, if not. – mklement0 Aug 28 '19 at 22:32
  • 1
    I've spent far too long looking at this (much of this evening as well in fact) and whilst I've been merrily testing a variation of the answer here: https://stackoverflow.com/a/41797153/572634 I like yours a LOT more for it's visual simplicity/readability. No doubt I'll have some questions tomorrow morning, but I shall arrive in the office with this ass-kicker of a function. You've made my day/week. :-) – AlexFielder Aug 28 '19 at 22:45
  • PS. Do you think it's possible to push the output to another module such as "Import-Excel" as described here: https://dfinke.github.io/powershell/2019/07/31/Creating-beautiful-Powershell-Reports-in-Excel.html? – AlexFielder Aug 28 '19 at 22:50
  • 1
    @AlexFielder: I would think so, yes, but note that extra work is needed if the output should be in the same order as the inputs. Another thing to keep an eye on: creating 1 thread per file may perform poorly; perhaps _batching_ helps that (multiple files per thread). – mklement0 Aug 28 '19 at 22:52
  • that's exactly the issue I just ran into. Am currently _"Waiting for 14979 jobs to complete"_ whilst PowerShell comsumes >4GB of memory – AlexFielder Aug 29 '19 at 08:57
  • The section of this article titled "Powershell Jobs": https://blogs.technet.microsoft.com/uktechnet/2016/06/20/parallel-processing-with-powershell/ looks like it will help alleviate bottlenecks with Job Numbers; my guess is I would have to set a minimum # of files per job/thread of say 1000 or so? – AlexFielder Aug 29 '19 at 10:02
  • 1
    @AlexFielder: You'll have to implement your own batching (chunking) of the inputs - please see my update; the added solution allows you to experiment with the batch size. – mklement0 Aug 29 '19 at 22:03