6

I have a native program written in Python that expects its input on stdin. As a simple example,

#!python3
import sys
with open('foo.txt', encoding='utf8') as f:
    f.write(sys.stdin.read())

I want to be able to pass a (PowerShell) string to this program as standard input. Python expects its standard input in the encoding specified in $env:PYTHONIOENCODING, which I will typically set to UTF8 (so that I don't get any encoding errors).

But no matter what I do, characters get corrupted. I've searched the net and found suggestions to change [Console]::InputEncoding/[Console]::OutputEncoding, or to use chcp, but nothing seems to work.

Here's my basic test:

PS >[Console]::OutputEncoding.EncodingName
Unicode (UTF-8)
PS >[Console]::InputEncoding.EncodingName
Unicode (UTF-8)
PS >$env:PYTHONIOENCODING
utf-8
PS >python -c "print('\N{Euro sign}')" | python -c "import sys; print(sys.stdin.read())"
´╗┐?

PS >chcp 1252
Active code page: 1252
PS >python -c "print('\N{Euro sign}')" | python -c "import sys; print(sys.stdin.read())"
?

PS >chcp 65001
Active code page: 65001
PS >python -c "print('\N{Euro sign}')" | python -c "import sys; print(sys.stdin.read())"
 ?

How can I fix this problem?

I can't even explain what's going on here. Basically, I want the test (python -c "print('\N{Euro sign}')" | python -c "import sys; print(sys.stdin.read())") to print out a Euro sign. And to understand why, I have to do whatever is needed to get that to work :-) (Because then I can translate that knowledge to my real scenario, which is to be able to write working pipelines of Python programs that don't break when they encounter Unicode characters).

Paul Moore
  • 6,569
  • 6
  • 40
  • 47
  • Have you tried setting `$OutputEncoding`? – Mike Zboray Sep 03 '14 at 13:38
  • Even worse: ```>$OutputEncoding = [Text.Encoding]::UTF8 >$env:PYTHONIOENCODING="utf-8" >python -c "print('\N{Euro sign}')" | python -c "import sys; print(sys.stdin.read())" ∩╗┐╬ô├⌐┬╝``` (Sorry about the formatting, I can't get newlines in a comment...) – Paul Moore Sep 03 '14 at 15:22
  • Ah, but if I *also* set [Console]::OutputEncoding to UTF8, this seems to work! Can you explain why? I'm not clear why I need to set the value twice... – Paul Moore Sep 03 '14 at 15:26
  • Also, something seems to add a space at the start (presumably a BOM). How do I avoid that? – Paul Moore Sep 03 '14 at 15:30
  • 1
    Ok, that makes sense I guess. `[Console]::OutputEncoding` is definitely different than $OutputEncoding. This [blog post](http://blogs.msdn.com/b/powershell/archive/2006/12/11/outputencoding-to-the-rescue.aspx) was where I got the idea. – Mike Zboray Sep 03 '14 at 15:30
  • `Encoding.UTF8` includes a BOM. I think you would have to do `new-object System.Text.UTF8Encoding $false` to get a non-BOM encoding. – Mike Zboray Sep 03 '14 at 15:32
  • Nice link, thanks. Now if I can find someone who can explain *why* this all works... :-) – Paul Moore Sep 03 '14 at 15:33
  • @PaulMoore: (1) don't put the answer into your question. You could [post your own answer](http://stackoverflow.com/help/self-answer). (2) I see **two** steps here: a) pipe Unicode from/to python via standard streams b) print it to the console. I'm not sure that `a` step is broken: what is `print(ascii(sys.stdin.read()))`? Step `b`: [the mojibake may be explained by incompatible encodings (python - utf-8, console sth. else). To print to Windows console, you could use `win-unicode-console` packages](http://stackoverflow.com/a/31949236/4279) (though I don't know whether it applies to PowerShell). – jfs Aug 19 '15 at 11:28

1 Answers1

4

Thanks to mike z, the following works:

$OutputEncoding = [Console]::OutputEncoding = (new-object System.Text.UTF8Encoding $false)
$env:PYTHONIOENCODING = "utf-8"
python -c "print('\N{Euro sign}')" | python -c "import sys; print(sys.stdin.read())"

The new-object is needed to get a UTF-8 encoding without a BOM. The $OutputEncoding variable and [Console]::OutputEncoding both appear to need to be set.

I still don't fully understand the difference between the two encoding values, and why you would ever have them set differently (which appears to be the default).

Paul Moore
  • 6,569
  • 6
  • 40
  • 47