Your own Python answer is likely the simplest and best-performing solution.
If you happen to have WSL installed, you can try the following to merge all file*.txt
into combined.text
while stripping the UTF-8 BOM from each (syntax is for calling from PowerShell):
bash.exe -c 'for f in file*.txt; do tail -c +4 \"\$f\"; done > combined.txt'
tail -c +4
strips the first 3 bytes from each file and passes the remaining bytes through. Note that the entire output from a for
loop can be captured by applying a redirection >
to the for
statement as a whole.
Note: Neither escaping "
nor $
with \
should be necessary here, but is as of this writing:
Calling bash.exe
with -c
unexpectedly subjects the command string to up-front string interpolation, necessitating escaping; note that the following, seemingly equivalent call via wsl.exe -e
does not exhibit this problem (which is why $f
isn't escaped as \$f
here):
wsl.exe -e bash -c 'for f in file*.txt; do tail -c +4 \"$f\"; done > combined.txt'
Independently, up to at least PowerShell 7.2.2, PowerShell's argument passing to external programs is fundamentally broken with respect to empty arguments and arguments with embedded "
chars., necessitating manual \
-escaping - see this answer.
As for native PowerShell solutions:
The size of your files likely necessitates a memory-efficient streaming solution.
However, given the object-based nature of the PowerShell pipeline, with no raw byte support, this is likely to be prohibitively slow, especially with byte-by-byte processing, where each byte must be converted to and from a .NET [byte]
object in memory.
For the record, here's the PowerShell solution, though it is likely too slow for large files:
# !! Unfortunately, the syntax for requesting byte-by-byte processing
# !! has changed between Windows PowerShell and PowerShell (Core) 7+,
$byteStreamParam =
if ($IsCoreClr) { @{ AsByteStream = $true } }
else { @{ Encoding = 'Byte' } }
Get-ChildItem -Filter file*.txt |
ForEach-Object {
$_ | Get-Content @byteStreamParam | Select-Object -Skip 3
} |
Set-Content @byteStreamParam -LiteralPath combined.txt
However, you can significantly improve the performance by using Get-Content
's -ReadCount
parameter to read the files in chunks (arrays of bytes). The larger the chunk size - memory permitting - the more runtime performance will improve:
$byteStreamParam =
if ($IsCoreClr) { @{ AsByteStream = $true } }
else { @{ Encoding = 'Byte' } }
# How many bytes to read at a time.
$chunkSize = 256mb
Get-ChildItem -Filter file*.txt |
ForEach-Object {
$first = $true
$_ | Get-Content @byteStreamParam -ReadCount $chunkSize | ForEach-Object {
if ($first) { $_[3..($_.Count-1)]; $first = $false }
else { $_ }
}
} |
Set-Content @byteStreamParam -LiteralPath combined.txt
Text-based PowerShell solutions:
Text-based solutions, while slower, have the advantage of enabling transcoding, i.e. transforming files from one character encoding to another, using the -Encoding
parameter of Get-Content
and Set-Content
.
In the simplest case - if the individual files fit into memory as a whole (which may not work for you), you can use Get-Content
's -Raw
switch[1] to read a file's content into memory as a single, multi-line string, which is fast.
While a line-by-line streaming solution (omitting -Raw
) is possible, it comes with caveats:
it will be slow, because PowerShell decorates each line read with metadata, which is both time-consuming and memory-intensive.
information about the input file's newline format (Windows-format CRLF vs. Unix-format LF) is invariably lost, including whether a trailing newline was present.
Note that Get-Content
needs no -Encoding
argument below, because both PowerShell editions directly recognize UTF-8 files with a BOM.
In Windows PowerShell, unfortunately, file-writing cmdlets do not support writing UTF-8 files without BOM - -Encoding utf8
invariable creates files with BOM, so assistance from .NET APIs is needed:
# Determine the output file, as a *full path*, because
# .NET's working dir. usually differs from PowerShell's.
$outFile = Join-Path ($PWD | Convert-Path) combined.text
# Create the output file, initially empty.
$null = New-Item -Path $outFile
Get-ChildItem -Filter file*.txt |
ForEach-Object {
# BOM-less UTF-8 is the default.
[IO.File]::AppendAllText($outFile, ($_ | Get-Content -Raw))
}
In PowerShell (Core) 7+, not only does -Encoding utf8
now produce BOM-less UTF-8 files (you can request with-BOM files with -Encoding utf8bom
), BOM-less UTF-8 is now the consistent default, so the solution simplifies to:
Get-ChildItem -Filter file*.txt |
Get-Content -Raw |
Set-Content -LiteralPath combined.txt # BOM-less UTF-8 implied.
[1] Despite its name, this switch does not result in reading raw bytes. Instead, it bypasses the default line by line reading in favor of reading the entire file content at once, into a single string.