2

On my Windows 10 PC, there are three files, 10GB each, that I want to merge via cat file_name_prefix* >> some_file.zip. However, the output file grew as much as 38GB large before I aborted the operation via Ctrl+C. Is this expected behavior? If not, where am I making a mistake?

asymmetryFan
  • 712
  • 1
  • 10
  • 18
  • 5
    have you looked at what `cat` does? it is an alias for `Get-Content`. the output is far more than just the lines of text in the file. plus, it aint meant for binary files at all. – Lee_Dailey Mar 14 '21 at 23:53
  • Huh, that's odd. Would like to see what someone on here has to say about this. So I'm following this post now. Also, `cat` is just an Alias to the `Get-Content` cmdlet. – Abraham Zinala Mar 14 '21 at 23:55

3 Answers3

3

Cat is an alias of Get-Content which assumes text files by default - the output size is probably due to this conversion. You can try adding the -raw switch for binary files - this might work? (not sure)

Its definitely possible to "cat" binary files together with a CMD shell using the copy command like below.

copy /b part1.bin+part2.bin+part3.bin some_file.zip

(The 3 part*.bin are the files to be combined into some_file.zip).

MisterSmith
  • 2,884
  • 1
  • 10
  • 13
  • `-Raw` by itself does _not_ help, but there is `-Encoding Byte` (Windows PowerShell) and `-AsByteStream` (PowerShell (Core) 7+) for byte handling - see [this answer](https://stackoverflow.com/a/1783725/45375) for an example. – mklement0 Mar 15 '21 at 15:36
  • Also worth noting that in order to run your command _from PowerShell_, `cmd /c` must be prepended (`copy` is a command that is _internal_ to `cmd.exe`). – mklement0 Mar 15 '21 at 15:38
3

PowerShell's cat A.K.A Get-Content reads text file content into an array of strings by default. It also reads the file and checks for the BOM to handle encodings properly if you don't specify a charset. That means it won't work with binary files

To combine binary files in PowerShell 6+ you need to use the -AsByteStream parameter

Get-Content -AsByteStream file_name_prefix* | `
    Set-Content -AsByteStream some_file.zip # or
Get-Content -AsByteStream file1, file2, file3 | `
    Set-Content -AsByteStream some_file.zip

Older PowerShell doesn't have that option so the only thing you can use is -Raw

Get-Content -Raw file_name_prefix* | Set-Content -Raw some_file.zip

However it'll be very slow because the input files are still treated as text files and read line-by-line. For speed you'll need to use other solutions, like calling Win32 APIs directly from PowerShell

Update:

As mentioned, there's only -Raw in Get-Content, not in Set-Content and it's unsuitable for binary content. You need to use -Encoding Byte

Get-Content -Encoding Byte file_name_prefix* | Set-Content -Encoding Byte some_file.zip

See

phuclv
  • 37,963
  • 15
  • 156
  • 475
1

It is probably going in a loop, recursively concatenating all files including the result to the result file (with the glob wildcard).

You can add an extension in the glob, temporarily save it as another extension and move it to the correct one. (As suggested in: https://stackoverflow.com/a/53079166/12657997)

E.g. when you have 3 files:

  • a.txt with a inside
  • b.txt with b inside
  • c.txt with c inside

cat *.txt > res.csv ; mv res.csv res.txt

cat .\res.txt
a
b
c

Edit

This cat command (as shown above), in combination with the output redirect > will increase the result text file as @mklement0 points out.

According to the documentation (https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.management/get-content?view=powershell-7.1):

-Encoding

Specifies the type of encoding for the target file. The default value is utf8NoBOM.

However the encoding with the output redirect changes the ecoding, as explained in this post: https://stackoverflow.com/a/40098904/12657997

To illustrate this I've converted the a.txt, b.txt and c.txt to zip files (now they are in a binary format).

cat -Encoding Byte *.zip > res.csv ; mv res.csv res2.txt
cat -Raw *.zip > res.csv ; mv res.csv res3.txt

ls .
Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       15/03/2021     21:29            109 a.zip
-a----       15/03/2021     21:29            109 b.zip
-a----       15/03/2021     21:29            109 c.zip
-a----       15/03/2021     21:39           2282 res2.txt
-a----       15/03/2021     21:41            668 res3.txt

We can see that the output size doubles in size for res3.txt (for every utf-8 byte read utf-16 will output 2.

The -Encoding Byte output, in combination with the output redirect, will make it even worse.

Woody
  • 76
  • 3
  • 1
    That's definitely _one_ pitfall (+1), but even with that out of the picture, another problem is the fact that `Get-Content` by default interprets its input as _text_ and that `>` / `>>` (effective aliases of the `Out-File` cmdlet) also applies a _character encoding_ on output. In Windows PowerShell, `>` / `>>` use UTF-16LE (!; "Unicode") by default, which has the potential to double the size of the original input. – mklement0 Mar 15 '21 at 14:49
  • 1
    @mklement0 you are right. I'll update the answer to reflect that it's for text files (how it is shown in the answer), not for binary files. – Woody Mar 15 '21 at 20:33