How can I use tar and tee in PowerShell to do a read once, write many, raw file copy

Question

I'm using a small laptop to copy video files on location to multiple memory sticks (~8GB). The copy has to be done without supervision once it's started and has to be fast.

I've identified a serious boundary to the speed, that when making several copies (eg 4 sticks, from 2 cameras, ie 8 transfers * 8Gb ) the multiple Reads use a lot of bandwidth, especially since the cameras are USB2.0 interface (two ports) and have limited capacity.

If I had unix I could use tar -cf - | tee tar -xf /stick1 | tee tar -xf /stick2 etc which means I'd only have to pull 1 copy (2*8Gb) from each camera once, on the USB2.0 interface.

The memory sticks are generally on a hub on the single USB3.0 interface that is driven on different channel so write sufficently fast.

For reasons, I'm stuck using the current Win10 PowerShell.

I'm currently writing the whole command to a string (concatenating the various sources and the various targets) and then using Invoke-Process to execute the copy process while I'm entertaining and buying the rounds in the pub after the shoot. (hence the necessity to be afk).

I can tar cf - | tar xf a single file, but can't seem to get the tee functioning correctly.

I can also successfully use the microSD slot to do a single cameras card which is not as physically nice but is fast on one cameras recording, but I still have the bandwidth issue on the remaining camera(s). We may end up with 4-5 source cameras at the same time which means the read once, write many, is still going to be an issue.

Edit: I've just advanced to play with Get-Content -raw | tee \stick1\f1 | tee \stick2\f1 | out-null . Haven't done timings or file verification yet....

Edit2: It seems like the Get-Content -raw works properly, but the functionality of PowerShell pipelines violates two of the fundamental Commandments of programming: A program shall do one thing and do it well, Thou shalt not mess with the data stream. For some unknown reason PowerShell default (and only) pipeline behaviour always modifies the datastream it is supposed to transfer from one stream to the next. Doesn't seem to have a -raw option nor does it seem to have a $session or $global I can set to remedy the mutilation.

How do PowerShell people transfer raw binary from one stream out, into the next process?

I don't think the problem is the pipeline, but the encoding of `Tee-Object`. Which PowerShell version are you using (check the variable `$PSVersionTable`)? — stackprotector, Dec 04 '21 at 15:23
As of PowerShell 7.2, output from external programs is invariably decoded _as text_ before further processing, which means that _raw byte output_ can neither be passed on via `|` nor captured with `>` - and performance invariably suffers. The workaround is to call your external program via `cmd /c` (Windows) / `sh -c` (Unix-like platforms) and use _their_ `|` and `>` operators. See [this answer](https://stackoverflow.com/a/59118502/45375). — mklement0, Dec 05 '21 at 00:01
If you install the **Windows Subsystem for Linux** on your Windows 10 machine "A starter set of commands is shown here, but you can generate a wrapper for any Linux command simply by adding it to the list. If you add this code to your PowerShell profile, these commands will be available to you in every PowerShell session just like native commands!" — NeoTheNerd, Dec 05 '21 at 09:26
@StackProtector, It was definitely the pipleline. I could do: (PS 5.1) $t = [System.Collections.ArrayList]@() foreach( $stick in $ws ) { #ws is AL of target drives $t.Add( -join( $stick, "\" , $i) ) | Out-Null } write-output "copying $i" get-content -Raw -Encoding Byte -Path $s | Set-Content -Encoding Byte -Path $t The Byte converts the binary to a decimal number eg " " = "32'r'n" with each character on its own line. Wonderfully robust for transfering files through any environment, but hideous for speed There was no -RAW, but worse >1Mb? just locked up — mist42nz, Dec 08 '21 at 02:27
@NeoTheNerd indeed I could but that means going back to old habits rather than keeping up with new developments. The other factor was that this machine is an older laptop with SSD and space is at a premium so I really want to keep installations and extra libraries to a minimum, especially when just adding a single feature/cmd. — mist42nz, Dec 08 '21 at 02:36
@mklement0 Powershell5.1 at the moment which does the mangled streams as well. If you had submit as answer I would likely have awarded you bounty as it resulted in very effective solution to problem, if not a direct address to the technique. $c = -join( 'Robocopy', $directory,' ', $stick ,' /S /MIN:1000000 /J /MT:8 ) cmd /c "$c" # or similar is effective in accomplishing copy with best speed but stuck in sequential repeated reads which causes bottleneck I'm reading taking off 2x8Gb through a USB2.0 (2camera) And writing 3-7 USB3.0 usbsticks. So 4-12 extra reads on USB2 — mist42nz, Dec 08 '21 at 02:51
@mist42nz Cannot reproduce. On my machine (Win 10, PS 5.1) `get-content -Raw -Encoding Byte -Path $s | Set-Content -Encoding Byte -Path $t` creates an exact copy of the file, does not matter if plain ASCII or binary, no bytes are modified. — stackprotector, Dec 08 '21 at 06:53
@stackprotector. you need to actually look at whats happening to reproduce. Drop the pipe, and do a get-content -Raw etc. You will see it does NOT produce "raw" output it produces ascii strings, one per line, for the asciii base 10 number of the byte. So chr(32) , a Space, will come out as "32'r'n" (without the "" ). That is huge translation overhead, and but the pipe can pass it as the lines of text that it is. Set-Content can write multiple files but has to decode the ascii (overhead). And they crash on large files. So if you get data bin file and transfer that as data, to the pipe — mist42nz, Dec 11 '21 at 20:56
@stackprotector So if you get data bin file and transfer that as data, to the pipe. eg in my case the playlist file which is binary data it goes in at 436bytes and comes out looking different but file size is now 548, and there are extra characters riddled through the file. This is due to the pipe (aka "shove one thing into a pipe, what do you get out the other side - the same thing because if you pipe something from A to B you get out the far end what you shove in this end") in Microsoft mutilating the non-text characters, and adding extra "/r" and "/n" in where it wants. — mist42nz, Dec 11 '21 at 20:58
@mist42nz If you just execute `get-content -Raw -Encoding Byte -Path $s`, then it is equivalent to `get-content -Raw -Encoding Byte -Path $s | Out-Default` which is formatting and printing your bytes. That's why you see CR&LF characters in the output. But those characters are not there if you don't print the bytes. Try `get-content -Raw -Encoding Byte -Path $s | Write-Host`, no CR&LF. — stackprotector, Dec 12 '21 at 20:37
@stackprotector Write-host isn't useful to the use function. I have to be able to pipe it to other things. The Write-Host basically does a reverse Encoding such as used by Set-Content, vs as you say, the default Out-Default. Problem being if I pipe the get-content to anything else, I get the default mangled form. I'm not outputing to Host screen, nor a single host file (eg for Out-File). I have to be able to split the streams or other way use the data stream ... and that triggers the other stuff. So while the data *in* the pipe is intact, and can be retrieved by a few end-process func — mist42nz, Dec 14 '21 at 05:23
@stackprotector the process of retrieving/exiting the pipe corrupts it unavoidably. Initially I thought it was a Tee issue, but its not, it's the exit of the pipe, passing through the out default. — mist42nz, Dec 14 '21 at 05:26

Igor N. · Accepted Answer · 2021-12-09T23:43:17.780

May be not quite what you want (if you insist on using built in Powershell commands), but if you care about speed, use streams and asynchronous Read/Write. Powershell is a great tool because it can use any .NET classes seamlessly.

The script below can easily be extended to write to more than 2 destinations and can potentially handle arbitrary streams. You might want to add some error handling via try/catch there too. You may also try to play with buffered streams with various buffer size to optimize the code.

Some references:

-- 2021-12-09 update: Code is modified a little to reflect suggestions from comments.

# $InputPath, $Output1Path, $Output2Path are parameters
[Threading.CancellationTokenSource] $cancellationTokenSource = [Threading.CancellationTokenSource]::new()
[Threading.CancellationToken] $cancellationToken = $cancellationTokenSource.Token

[int] $bufferSize = 64*1024

$fileStreamIn = [IO.FileStream]::new($inputPath,[IO.FileMode]::Open,[IO.FileAccess]::Read,[IO.FileShare]::None,$bufferSize,[IO.FileOptions]::SequentialScan)
$fileStreamOut1 = [IO.FileStream]::new($output1Path,[IO.FileMode]::CreateNew,[IO.FileAccess]::Write,[IO.FileShare]::None,$bufferSize)
$fileStreamOut2 = [IO.FileStream]::new($output2Path,[IO.FileMode]::CreateNew,[IO.FileAccess]::Write,[IO.FileShare]::None,$bufferSize)

try{
    [Byte[]] $bufferToWriteFrom = [byte[]]::new($bufferSize)
    [Byte[]] $bufferToReadTo = [byte[]]::new($bufferSize)
    $Time = [System.Diagnostics.Stopwatch]::StartNew()

    $bytesRead = $fileStreamIn.read($bufferToReadTo,0,$bufferSize)

    while ($bytesRead -gt 0){
        $bufferToWriteFrom,$bufferToReadTo = $bufferToReadTo,$bufferToWriteFrom    
        $writeTask1 = $fileStreamOut1.WriteAsync($bufferToWriteFrom,0,$bytesRead,$cancellationToken)
        $writeTask2 = $fileStreamOut2.WriteAsync($bufferToWriteFrom,0,$bytesRead,$cancellationToken)
        $readTask = $fileStreamIn.ReadAsync($bufferToReadTo,0,$bufferSize,$cancellationToken)
        $writeTask1.Wait()
        $writeTask2.Wait()
        $bytesRead = $readTask.GetAwaiter().GetResult()    
    }
    $time.Elapsed.TotalSeconds
}
catch {
    throw $_
}
finally{
    $fileStreamIn.Close()
    $fileStreamOut1.Close()
    $fileStreamOut2.Close()
}

This works with some juggling. I ended up with $fileStreamOut = New-Object ... as PS5.1 was happier with that. Buffer size is big issue I think. The first runs were extremely slow. Also think I have a huge down in the Wait cycle, since I loop through several targets. Why the swap of buffer at the end? — mist42nz, Dec 08 '21 at 02:53
two buffers are used because one buffer is being written to destinations and another buffer is being filled with new data from input *at the same time*. When two writers and one reader tasks are done, I swap them, so that recently updated buffer will be used by writers, and recently written buffer can be discarded and used to fetch new chunk of data. — Igor N., Dec 08 '21 at 03:56
The "wait cycles" are just synchronizing the tasks, you are basically waiting for when all writers are done writing and reader is done fetching new set of data. One important point is that FileStream's internal buffer size is 4096 by default, it can be adjusted and I would recommend trying [different values there](https://docs.microsoft.com/en-us/dotnet/api/system.io.filestream.-ctor?view=net-6.0#System_IO_FileStream__ctor_System_String_System_IO_FileMode_System_IO_FileAccess_System_IO_FileShare_System_Int32_System_IO_FileOptions_) (optimal value is usually specific to the hardware used). — Igor N., Dec 08 '21 at 04:09
The buffer size (10,000) that I used is just number of bytes between task synchronization. Probably it should be a multiple of FileStream's internal buffer size. For large files, the default 4096 is probably not the best value and increasing it to 100k or so could speed up the transfer by some factor. — Igor N., Dec 08 '21 at 04:18
Passing the `[IO.FileOptions]::SequentialScan` hint for `$fileStream` can improve performance. I'd size the buffers to some multiple of the filesystem cluster size; 64 KB would be a good starting point. I've experimented in the past with getting the best performance out of `FileStream` and I think after about 4 MB there were diminishing returns on buffer size, though that was with a RAM drive. Also remember that PowerShell code is _slow_ compared to the .NET calls, so the less you run the better; 8 GB / 4 KB buffer = 2,097,152 loop iterations vs. 4 MB buffer = 2,048 loop iterations. — Lance U. Matthews, Dec 08 '21 at 07:34

How can I use tar and tee in PowerShell to do a read once, write many, raw file copy

1 Answers1