2

Off the back of this thread: Copy-item using invoke-async in Powershell I have the following:

@mklement0's method (Copied from and amended by from here) works, but because it creates a thread per-file is exceptionally slow and on my test system working with ~14,000 files consumed > 4GB of memory:

# This works but is INCREDIBLY SLOW because it creates a thread per file
 Create sample CSV file with 10 rows.
 $FileList = Join-Path ([IO.Path]::GetTempPath()) "tmp.$PID.csv"
 @'
 Foo,SrcFileName,DestFileName,Bar
 1,c:\tmp\a,\\server\share\a,baz
 2,c:\tmp\b,\\server\share\b,baz
 3,c:\tmp\c,\\server\share\c,baz
 4,c:\tmp\d,\\server\share\d,baz
 5,c:\tmp\e,\\server\share\e,baz
 6,c:\tmp\f,\\server\share\f,baz
 7,c:\tmp\g,\\server\share\g,baz
 8,c:\tmp\h,\\server\share\h,baz
 9,c:\tmp\i,\\server\share\i,baz
 10,c:\tmp\j,\\server\share\j,baz
 '@ | Set-Content $FileList

# How many threads at most to run concurrently.
 $NumCopyThreads = 8

Write-Host 'Creating jobs...'
$dtStart = [datetime]::UtcNow

# Import the CSV data and transform it to [pscustomobject] instances
# with only .SrcFileName and .DestFileName properties - they take
# the place of your original [fileToCopy] instances.
$jobs = Import-Csv $FileList | Select-Object SrcFileName, DestFileName | 
  ForEach-Object {
    # Start the thread job for the file pair at hand.
    Start-ThreadJob -ThrottleLimit $NumCopyThreads -ArgumentList $_ { 
        param($f) 
        [System.IO.Fileinfo]$DestinationFilePath = $f.DestFileName
        [String]$DestinationDir = $DestinationFilePath.DirectoryName
        if (-not (Test-path([Management.Automation.WildcardPattern]::Escape($DestinationDir)))) {
            new-item -Path $DestinationDir -ItemType Directory #-Verbose
        }
        copy-item -path $f.srcFileName -Destination $f.destFilename
        "Copied $($f.SrcFileName) to $($f.DestFileName)"
    }
  }

Write-Host "Waiting for $($jobs.Count) jobs to complete..."

# Synchronously wait for all jobs (threads) to finish and output their results
# *as they become available*, then remove the jobs.
# NOTE: Output will typically NOT be in input order.
Receive-Job -Job $jobs -Wait -AutoRemoveJob
Write-Host "Total time lapsed: $([datetime]::UtcNow - $dtStart)"

# Clean up the temp. file
Remove-Item $FileList

This article (the PowerShell Jobs section in particular) gave me the idea for splitting up the complete list into batches of 1000 files, and when it runs in my test case I get 15 threads (as I have ~14,500 files) but the threads only process the first file in each "chunk" and then stop:

<#
.SYNOPSIS
<Brief description>
For examples type:
Get-Help .\<filename>.ps1 -examples
.DESCRIPTION
Copys files from one path to another
.PARAMETER FileList
e.g. C:\path\to\list\of\files\to\copy.txt
.PARAMETER NumCopyThreads
default is 8 (but can be 100 if you want to stress the machine to maximum!)
.PARAMETER LogName
default is output.csv located in the same path as the Filelist
.EXAMPLE
to run using defaults just call this file:
.\CopyFilesToBackup
to run using anything else use this syntax:
.\CopyFilesToBackup -filelist C:\path\to\list\of\files\to\copy.txt -NumCopyThreads 20 -LogName C:\temp\backup.log -CopyMethod Runspace
.\CopyFilesToBackup -FileList .\copytest.csv -NumCopyThreads 30 -Verbose
.NOTES
#>

[CmdletBinding()] 
Param( 
    [String] $FileList = "C:\temp\copytest.csv", 
    [int] $NumCopyThreads = 8,
    [String] $LogName
) 

$filesPerBatch = 1000

$files = Import-Csv $FileList | Select-Object SrcFileName, DestFileName

$i = 0
$j = $filesPerBatch - 1
$batch = 1

Write-Host 'Creating jobs...'
$dtStart = [datetime]::UtcNow

$jobs = while ($i -lt $files.Count) {
    $fileBatch = $files[$i..$j]

    $jobName = "Batch$batch"
    Start-ThreadJob -Name $jobName -ThrottleLimit $NumCopyThreads -ArgumentList ($fileBatch) -ScriptBlock {
        param($filesInBatch)
        foreach ($f in $filesInBatch) {
            [System.IO.Fileinfo]$DestinationFilePath = $f.DestFileName
            [String]$DestinationDir = $DestinationFilePath.DirectoryName
            if (-not (Test-path([Management.Automation.WildcardPattern]::Escape($DestinationDir)))) {
                new-item -Path $DestinationDir -ItemType Directory -Verbose
            }
            copy-item -path $f.srcFileName -Destination $f.DestFileName -Verbose
        }
    } 

    $batch += 1
    $i = $j + 1
    $j += $filesPerBatch

    if ($i -gt $files.Count) {$i = $files.Count}
    if ($j -gt $files.Count) {$j = $files.Count}
}

Write-Host "Waiting for $($jobs.Count) jobs to complete..."

Receive-Job -Job $jobs -Wait -AutoRemoveJob
Write-Host "Total time lapsed: $([datetime]::UtcNow - $dtStart)"

I feel like I'm missing something obvious but I don't know what.

Can anyone help?

AlexFielder
  • 137
  • 1
  • 12

2 Answers2

3

Change:

Start-ThreadJob -Name $jobName -ThrottleLimit $NumCopyThreads -ArgumentList ($fileBatch) -ScriptBlock {

to

Start-ThreadJob -Name $jobName -ThrottleLimit $NumCopyThreads -ArgumentList (,$fileBatch) -ScriptBlock {

Note the comma before $fileBatch in argument list.

The reason this fixes it is because ArgumentList is expecting an array and gives each element to the parameters. You're trying to pass the entire array to the first parameter, which means you have to put your array inside an array.

Apparently (this is news to me), Powershell will happily treat your string as a single item array in the foreach loop, which is why the first item is processed in each batch.

Doug Richardson
  • 10,483
  • 6
  • 51
  • 77
  • 1
    That's mad. Thank you for finding this. I need to make a t-shirt that says "Yes, I forgot a comma". – AlexFielder Aug 29 '19 at 18:28
  • 1
    In my comp sci 101 course in university my instructor told me he spent hours on a problem that turned out to be an errant comma. Made me question why I was there ;) – Doug Richardson Aug 29 '19 at 18:38
  • I'm curious how you even began to track that down in this case? I've been using VSCode recently and the debugging feature is pretty handy, but for this stuff it gets to the Thread section and turns into vapourware; I realise that by its very nature that Threads are hard to debug but ¯\_(ツ)_/¯ - I've been toying with getting a license of PowerShell Pro Tools but am not sure it's worth it: https://poshtools.com/powershell-pro-tools/ Any advice you can give will be great! – AlexFielder Aug 29 '19 at 20:06
  • 3
    I noticed that the code running in the job didn't print anything when I tried to log, so I used the `Add-Content` function to append data to a logs called Batch1, Batch2, etc (i.e., the names of the jobs). My logs showed the loop was only running once and that the length of the input array of batch files was also 1. At that point, I looked at the input parameter which was fine before `-ArgumentList` so I focused on `-ArgumentList`. I really don't know anything... I'm just a process of elimination monkey ;) – Doug Richardson Aug 29 '19 at 20:11
  • Can you share a sample of how to use the Add-Content cmdlet? (I'll have to figure it out eventually anyway and you'll save me some time.) – AlexFielder Aug 29 '19 at 21:17
  • 1
    @AlexFielder `Add-Content "output-file1.txt" "Line 1"`. Every time you call it, it adds a line to the output file. I used a different file for each thread/job by using the job name as the output filename. – Doug Richardson Aug 29 '19 at 21:39
0

So it's taken a week of trial and error to arrive at this point and on the whole I'm pretty happy with the results. The script I'm going to share below takes care of ~3 steps in the processing of the files I'm working with:

  1. Creates folders
  2. Copies files to new folders
  3. Verifies files have copied without error

It does this whilst taking <1/3 of the time that doing steps 1) & 2) in Excel (using FileSystemObject to copy files)

.SYNOPSIS
<Brief description>
For examples type:
Get-Help .\<filename>.ps1 -examples
.DESCRIPTION
Copys files from one path to another
.PARAMETER FileList
e.g. C:\path\to\list\of\files\to\copy.txt
.PARAMETER NumCopyThreads
default is 8 (but can be 100 if you want to stress the machine to maximum!)
.PARAMETER FilesPerBatch
default is 1000 this can be tweaked if performance becomes an issue because the Threading will HAMMER any network you run it on.
.PARAMETER LogName
Desired log file output. Must include full or relative (.\blah) path. If blank, location of FileList is used.
.PARAMETER DryRun
Boolean value denoting whether we're testing this thing or not. (Default is $false)
.PARAMETER DryRunNum
The number of files to Dry Run. (Default is 100)
.EXAMPLE
to run using defaults just call this file:
.\CopyFilesToBackup
to run using anything else use this syntax:
.\CopyFilesToBackup -filelist C:\path\to\list\of\files\to\copy.txt -NumCopyThreads 20 -LogName C:\temp\backup.log -CopyMethod Runspace
.\CopyFilesToBackup -FileList .\copytest.csv -NumCopyThreads 30 -Verbose
.NOTES
#>

[CmdletBinding()] 
Param( 
    [String] $FileList = "C:\temp\copytest.csv", 
    [int] $NumCopyThreads =75,
    [String] $JobName,
    [int] $FilesPerBatch = 1000,
    [String] $LogName,
    [Boolean] $DryRun = $false, #$true,
    [int] $DryRunNum = 100
) 



Write-Host 'Creating log file if it does not exist...'

function CreateFile([string]$filepath) {
    if (-not (Test-path([Management.Automation.WildcardPattern]::Escape($filepath)))) {
        new-item -Path $filepath -ItemType File
    }
    if (-not (Test-path([Management.Automation.WildcardPattern]::Escape($filepath)))) {
        return $false
    } else {
        return $true
    }
}

$dtStart = [datetime]::UtcNow

if ($LogName -eq "") {
    [System.IO.Fileinfo]$CsvPath = $FileList
    [String]$LogDirectory = $CsvPath.DirectoryName
    [string]$LognameBaseName = $CsvPath.BaseName
    $LogName = $LogDirectory + "\" + $LognameBaseName + ".log"
    if (-not (CreateFile($LogName)) ) { 
        write-host "Unable to create log, exiting now!"
        Break
    }
}
else {
    if (-not (CreateFile($LogName)) ) { 
        write-host "Unable to create log, exiting now!"
        Break
    }
}

Add-Content -Path $LogName -Value "[INFO],[Src Filename],[Src Hash],[Dest Filename],[Dest Hash]"

Write-Host 'Loading CSV data into memory...'

$files = Import-Csv $FileList | Select-Object SrcFileName, DestFileName

Write-Host 'CSV Data loaded...'

Write-Host 'Collecting unique Directory Names...'

$allFolders = New-Object "System.Collections.Generic.List[PSCustomObject]"

ForEach ($f in $files) {
    [System.IO.Fileinfo]$DestinationFilePath = $f.DestFileName
    [String]$DestinationDir = $DestinationFilePath.DirectoryName
    $allFolders.add($DestinationDir)
}

$folders = $allFolders | get-unique

Write-Host 'Creating Directories...'
foreach($DestinationDir in $folders) {
    if (-not (Test-path([Management.Automation.WildcardPattern]::Escape($DestinationDir)))) {
        new-item -Path $DestinationDir -ItemType Directory | Out-Null #-Verbose
    }
}
Write-Host 'Finished Creating Directories...'
$scriptBlock = {
    param(
        [PSCustomObject]$filesInBatch, 
        [String]$LogFileName)
        function ProcessFileAndHashToLog {
            param( [String]$LogFileName, [PSCustomObject]$FileColl)
            foreach ($f in $FileColl) {
                $mutex = New-object -typename 'Threading.Mutex' -ArgumentList $false, 'MyInterProcMutex'
                # [System.IO.Fileinfo]$DestinationFilePath = $f.DestFileName
                # [String]$DestinationDir = $DestinationFilePath.DirectoryName
                # if (-not (Test-path([Management.Automation.WildcardPattern]::Escape($DestinationDir)))) {
                #     new-item -Path $DestinationDir -ItemType Directory | Out-Null #-Verbose
                # }
                copy-item -path $f.srcFileName -Destination $f.DestFileName | Out-Null #-Verbose

                $srcHash = (Get-FileHash -Path $f.srcFileName -Algorithm SHA1).Hash #| Out-Null #could also use MD5 here but it needs testing
                if (Test-path([Management.Automation.WildcardPattern]::Escape($f.destFileName))) {
                    $destHash = (Get-FileHash -Path $f.destFileName -Algorithm SHA1).Hash #| Out-Null #could also use MD5 here but it needs testing
                } else {
                    $destHash = $f.destFileName + " not found at location."
                }
                if (-not ($null -eq $destHash) -and -not ($null -eq $srcHash)) {
                    $info = $f.srcFileName + "," + $srcHash + "," + $f.destFileName + "," + $destHash
                }
                $mutex.WaitOne() | Out-Null
                $DateTime = Get-date -Format "yyyy-MM-dd HH:mm:ss:fff"
                if ($DryRun) { Write-Host 'Writing to log file: '$LogFileName'...' }
                Add-Content -Path $LogFileName -Value "$DateTime,$Info"
                $mutex.ReleaseMutex() | Out-Null
            }
        }
        ProcessFileAndHashToLog -LogFileName $LogFileName -FileColl $filesInBatch
}

$i = 0
$j = $filesPerBatch - 1
$batch = 1
Write-Host 'Creating jobs...'
if (-not ($DryRun)) {
    $jobs = while ($i -lt $files.Count) {
        $fileBatch = $files[$i..$j]
        Start-ThreadJob -Name $jobName -ArgumentList $fileBatch, $LogName -ScriptBlock $scriptBlock #-ThrottleLimit $NumCopyThreads -ArgumentList $fileBatch, $LogName -ScriptBlock $scriptBlock
        $batch += 1
        $i = $j + 1
        $j += $filesPerBatch
        if ($i -gt $files.Count) {$i = $files.Count}
        if ($j -gt $files.Count) {$j = $files.Count}
    }
    Write-Host "Waiting for $($jobs.Count) jobs to complete..."
    Receive-Job -Job $jobs -Wait -AutoRemoveJob
} else {
    Write-Host 'Going in Dry...'
    $DummyFileBatch = $files[$i..$DryRunNum]
    & $scriptBlock -filesInBatch $DummyFileBatch -LogFileName $LogName
    Write-Host 'That wasn''t so bad was it..?'
}

Write-Host "Total time lapsed: $([datetime]::UtcNow - $dtStart)"

(I'll happily accept suggestions that improve the above solution.)

AlexFielder
  • 137
  • 1
  • 12