9

I am trying to redirect input in PowerShell by:

Get-Content input.txt | my-program args

The problem is the piped UTF-8 text is preceded with a BOM (0xEFBBBF), and my program cannot handle that correctly.

A minimal working example:

// File: Hex.java
import java.io.IOException;

public class Hex {
    public static void main(String[] dummy) {
        int ch;
        try {
            while ((ch = System.in.read()) != -1) {
                System.out.print(String.format("%02X ", ch));
            }
        } catch (IOException e) {
        }
    }
}

Then in PowerShell:

javac Hex.java
Set-Content textfile "ABC" -Encoding Ascii
# Now the content of textfile is 0x41 42 43 0D 0A
Get-Content textfile | java Hex

Or simply

javac Hex.java
Write-Output "ABC" | java Hex

In either case, the output is EF BB BF 41 42 43 0D 0A.

How can I pipe the text into the program without 0xEFBBBF?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user
  • 475
  • 3
  • 10
  • what happens if you cast it as string explicitly? `[string](Get-Content input.txt) | my-program args` – Guenther Schmitz Feb 08 '20 at 07:52
  • @GuentherSchmitz The BOM is still here, but all newlines but the last one is converted to space. By the way, although my input.txt uses LF, the content passed via pipeline uses CR LF under both cases. – user Feb 08 '20 at 08:02
  • There is a big difference between `(Get-Content input.txt) |` which passes a single string (with newlines) to the pipeline and `Get-Content input.txt |` which passes multiple strings to the pipeline (where each string represents a line). Note that if you pass this to an variable (`String[]`) it might be separated by a space or a newline depending on *how* you display it. Also note that for the later syntax your `my-program` needs to be able to `process` each individual item in the pipeline. Given the details in your question I doubt whether your program is actually doing that. – iRon Feb 08 '20 at 08:39
  • @iRon I do not quite understand that "`my-program` needs to be able to `precess` each item in the pipeline". In my case the program is a java program reading stdin. and the parentheses does not change the text read. – user Feb 08 '20 at 08:49
  • With all respect, I think you are too convinced the issue due to the BOM and not somewhere else. Normally `Get-Content` doesn't pass any BOM information. If you are not in a normal environment, you should supply details like OS, PowerShell version, etc. To confirm that you really retrieving any BOM information with `Get-Content` from your file, please show use the first few lines of your file: `Get-Content .\Bom.txt | Select -First 3 | ForEach-Object { "$([Byte[]]$_.ToCharArray())" }`. Please add these details **to the question**, see also: [mcve]. – iRon Feb 08 '20 at 12:58
  • A comment in powershell starts with '#'. Nice example otherwise. I don't get a bom in osx. – js2010 Feb 08 '20 at 15:34
  • Thanks for adding the details, it makes the issue more clear. Unfortunately I am not able to adequately respond right now, but believe your issue is related to this issue/answer: https://stackoverflow.com/a/40098904/1701026 – iRon Feb 08 '20 at 15:35
  • I can't reproduce your issue. In windows 10 with ps 5.1 I get `41 42 43 0D 0A`. It doesn't matter what the encoding of the file is. What os and powershell version and java version are you? PS 6 & 7 do the same. – js2010 Feb 08 '20 at 15:42
  • @js2010 It is windows 10 version 2004, ps 5.1 and java 13. I will test on some other platforms later. – user Feb 08 '20 at 15:56
  • @iRon I read it before I asked here. That is about encoding of files and I don't know how that method can be used for pipelines. – user Feb 08 '20 at 15:59
  • 1
    And you're in the powershell console, not the ise or vscode or windows terminal? – js2010 Feb 08 '20 at 16:18
  • @js2010 The problem exists in both powershell console and vscode for me. Java doesn't seem wrong, a c program can also read the bom. This does not happen in powershell 6+. Going to test on other versions of windows 10. – user Feb 09 '20 at 03:14
  • 2
    @user When PowerShell outputs to an external program (most .exe files), `$OutputEncoding` is used to determine the encoding. You could try `$OutputEncoding = [System.Text.UTF8Encoding]::new($false)` before performing your commands. – AdminOfThings Feb 09 '20 at 13:23
  • `Set-Content 'D:\textfile.txt' "ABC" -Encoding Ascii; Get-Content 'D:\textfile.txt' -Encoding Byte | ForEach-Object { '{0:X2}' -f $_ }` returns `41` `42` `43` `0D` `0A`. No BOM whatsoever. As said in my answer check the OutputEncoding you have set in PowerShell and change that to use UFT8 without BOM if needed. – Theo Feb 09 '20 at 13:54
  • P.S. Did you by any change 'hack' the codepage with `chcp 65001` at some point? In that case, I recommend turning that back to `chcp 5129` for **English - New Zealand**. See [here](https://webcheatsheet.com/html/character_sets_list.php) – Theo Feb 09 '20 at 14:01
  • That version of Windows 10 is very new. Does this show the BOM? `get-content textfile | format-hex`. It doesn't for me in osx, even if the file has a bom. I'm in ps 7 rc1 though. – js2010 Feb 09 '20 at 16:47
  • @js2010, your command doesn't apply: `Get-Content` _never_ sends a BOM through the pipeline, and `Format-Hex` is a PowerShell command, not an external program such as `java`. A BOM may appear for _unrelated_ reasons, irrespective of where the data came from: It can appear as a side effect of setting `$OutputEncoding` to an encoding _with a BOM_, which causes PowerShell to encode the string sent to _external programs_ with that BOM; AdminOfThings' comment shows the solution that _should_ work in a normal PS environment (there's something unusual going on on _one_ of user's machines). – mklement0 Feb 10 '20 at 14:46
  • I even tried win10 2004. No bom. Maybe you made a special config to $outputencoding. Check your $profile. – js2010 Feb 11 '20 at 21:25
  • @mklement0 You're right. The bom didn't show up in format-hex, even with [Text.Encoding]::Utf8 in windows 10 2004. My money is on him setting that encoding in his $profile. The $profile is different in ps core. – js2010 Feb 12 '20 at 19:26

3 Answers3

8

Note:
The following contains general information that in a normally functioning PowerShell environment would explain the OP's symptom. That the solution doesn't work in the OP's case is owed to machine-specific causes that are unknown at this point.
This answer is about sending BOM-less UTF-8 to an external program; if you're looking to make your PowerShell console windows use UTF-8 in all respects, see this answer.

To ensure that your Java program receives its input UTF-8-encoded without a BOM, you must set $OutputEncoding to a System.Text.UTF8Encoding instance that does not emit a BOM:

# Assigns UTF-8 encoding *without a BOM*.
# PowerShell uses this encoding to encode data piped to external programs.
# $OutputEncoding defaults to ASCII(!) in Windows PowerShell, and more sensibly
# to BOM-*less* UTF-8 in PowerShell [Core] v6+
$OutputEncoding = [Text.UTF8Encoding]::new($false)

Caveats:

  • Do NOT use the seemingly equivalent New-Object Text.Utf8Encoding $false, because, due to the bug described in GitHub issue #5763, it won't work if you assign to $OutpuEncoding in a non-global scope, such as in a script. In PowerShell v4 and below, use
    (New-Object Text.Utf8Encoding $false).psobject.BaseObject as a workaround.

  • Windows 10 version 1903 and up allow you to set BOM-less UTF-8 as the system-wide default encoding (although note that the feature is still classified as beta as of version 20H2) - see this answer; [fixed in PowerShell 7.1] in PowerShell [Core] up to v7.0, with this feature turned on, the above technique is not effective, due to a presumptive .NET Core bug that causes a UTF-8 BOM always to be emitted, irrespective of what encoding you set $OutputEncoding to (the bug is possibly connected to GitHub issue #28929); the only solution is to turn the feature off, as shown in imgx64's answer.

If, by contrast, you use [Text.Encoding]::Utf8, you'll get a System.Text.Encoding.UTF8 instance with BOM - which is what I suspect happened in your case.


Note that this problem is unrelated to the source encoding of any file read by Get-Content, because what is sent through the PowerShell pipeline is never a stream of raw bytes, but .NET objects, which in the case of Get-Content means that .NET strings are sent (System.String, internally a sequence of UTF-16 code units).

Because you're piping to an external program (a Java application, in your case), PowerShell character-encodes the (stringified-on-demand) objects sent to it based on preference variable $OutputEncoding, and the resulting encoding is what the external program receives.

Perhaps surprisingly, even though BOMs are typically only used in files, PowerShell respects the BOM setting of the encoding assigned to $OutputEncoding also in the pipeline, prepending it to the first line sent (only).

See the bottom section of this answer for more information about how PowerShell handles pipeline input for and output from external programs, including how it is [Console]::OutputEncoding that matters when PowerShell interprets data received from external programs.


To illustrate the difference using your sample program (note how using a PowerShell string literal as input is sufficient; no need to read from a file):

# Note the EF BB BF sequence representing the UTF-8 BOM.
# Enclosure in & { ... } ensures that a local, temporary copy of $OutputEncoding
# is used.
PS> & { $OutputEncoding = [Text.Encoding]::Utf8; 'hö' | java Hex }
EF BB BF 68 C3 B6 0D 0A

# Note the absence of EF BB BF, due to using a BOM-less
# UTF-8 encoding.
PS> & { $OutputEncoding = [Text.Utf8Encoding]::new($false); 'hö' | java Hex }
68 C3 B6 0D 0A

In Windows PowerShell, where $OutputEncoding defaults to ASCII(!), you'd see the following with the default in place:

# The default of ASCII(!) results in *lossy* encoding in Windows PowerShell.
PS> 'hö' | java Hex 
68 3F 0D 0A

Note that 3F represents the literal ? character, which is what the non-ASCII ö character was transliterated too, given that it has no representation in ASCII; in other words: information was lost.

PowerShell [Core] v6+ now sensibly defaults to BOM-less UTF-8, so the default behavior there is as expected.
While BOM-less UTF-8 is PowerShell [Core]'s consistent default, also for cmdlets that read from and write to files, on Windows [Console]::OutputEncoding still reflects the active OEM code page by default as of v7.0, so to correctly capture output from UTF-8-emitting external programs, it must be set to [Text.UTF8Encoding]::new($false) as well - see GitHub issue #7233.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    I cannot reproduce in a virtual machine (Windows 10 build 1909) either. Probably something is wrong on my end. Thank you anyway. Just for reference, it is Windows 10 build 19559, Powershell 5.1.19559.1000. The problem happens in PS console, vscode console and ISE, but not for Powershell v6+. – user Feb 10 '20 at 04:43
  • Thanks, @user. Intriguing symptom; do tell us if you ever find the cause. – mklement0 Feb 10 '20 at 05:01
1

You could try setting the OutputEncoding to UTF-8 without BOM:

# Keep the current output encoding in a variable
$oldEncoding = [console]::OutputEncoding

# Set the output encoding to use UTF8 without BOM
[console]::OutputEncoding = New-Object System.Text.UTF8Encoding $false

Get-Content input.txt | my-program args

# Reset the output encoding to the previous
[console]::OutputEncoding = $oldEncoding

If the above has no effect and your program does understand UTF-8, but only expects it to be without the 3-byte BOM, then you can try removing the BOM from the content and pipe the result your program

(Get-Content 'input.txt' -Raw -Encoding UTF8) -replace '^\xef\xbb\xbf' |  my-program args

If ever you have 'hacked' the codepage with chcp 65001, I recommend turning that back to chcp 5129 for English - New Zealand. See here.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Theo
  • 57,719
  • 8
  • 24
  • 41
  • @user Please try the alternative I have edited in. Since we do not know your `my-program` and have no idea what the parameter `args` does, it is difficult to understand what is going on there.. – Theo Feb 08 '20 at 11:59
  • Doesn't work either. I do not think my program is a problem because I wrote a small program that merely print hexadecimal value of each byte from stdin, and find it print efbbbf first. – user Feb 08 '20 at 12:22
  • @user Then I suggest you edit that program and have it detect if a BOM is present and if so skip those bytes. You can then also include other encoding byte order marks: `\x00\x00\xfe\xff UTF-32 big-endian /// \xff\xfe\x00\x00 UTF-32, little-endian /// \xfe\xff UTF-16, big-endian /// \xff\xfe UTF-16, little-endian /// \xef\xbb\xbf UTF-8`. This will make for a much more useful utility I would say because as it turns out now it can only handle UTF8 witout BOM.. – Theo Feb 08 '20 at 12:36
  • `+1` for the information about changing encoding from within PowerShell. It apparently doesn't fix this issue, but it might be the best answer to this one: [How to store directional arrow character in a variable](https://stackoverflow.com/q/60093096/1701026) – iRon Feb 08 '20 at 13:57
  • @iRon Thanks. I didn't see that question before, but I do believe the answer for that one is using the outputencoding. (cannot test because I don't have tshark.exe) – Theo Feb 08 '20 at 15:28
  • `[Console]::OutputEncoding` only matters with respect to how PowerShell interprets output _from_ external programs, whereas what matters here is how PowerShell encodes data sent _to_ external programs, which is controlled by the `$OutputEncoding` preference variable. Note that if `input.txt` has a UTF-8 BOM, what `Get-Content` reads into .NET string (before further processing) - sensibly - does _not_ have this BOM. – mklement0 Feb 10 '20 at 02:29
  • @mklement0 Thanks for explaining. Pretty confusing naming there, so I tend to not keep them apart.. – Theo Feb 10 '20 at 11:15
  • Fully agreed re the confusing naming; also, arguably it should be a _single_ preference variable that controls the encoding for both sending to and receiving from external programs. – mklement0 Feb 10 '20 at 12:51
0

Although mklement0's answer worked for me on one PC, it didn't work on another PC.

The reason was that I had the Beta: Use Unicode UTF-8 for worldwide language support checkbox selected in LanguageAdministrative language settingsChange system locale.

I unchecked it and now $OutputEncoding = [Text.UTF8Encoding]::new($false) works as expected.

It's odd that enabling it forces BOM, but I guess it's beta for a reason.

Uncheck Beta: Use Unicode UTF-8 for worldwide language support

imgx64
  • 4,062
  • 5
  • 28
  • 44