Difficulty with dealing with Unicode from sys.stdin

Question

This is driving my somewhat nutty at the moment. It is clear from my last days of research that unicode is a complex topic. But here is behavior that I do not know how to address.

If I read a file with non-ASCII characters from disk and wrtie it back to file everything works as planned. however, when I read the same file from sys.stdin, id does not work and the the non-ASCII characters are not encoded properly. The sample code is here:

# -*- coding: utf-8 -*-
import sys

with open("testinput.txt", "r") as ifile:
    lines = ifile.read()

with open("testout1.txt", "w") as ofile:
    for line in lines:
        ofile.write(line)

with open("testout2.txt", "w") as ofile:
    for line in sys.stdin:
        ofile.write(line)

The input file testinput.txt is this:

を
Sōten_Kōro

when I run the script from command line as cat testinput.txt | python test.py I get the following output respectively:

testout1.txt:

を Sōten_Kōro

testout2.txt:

??? S??ten_K??ro

Any ideas how to adress this would be of great help. Thanks. Paul.

Are you sure that `cat` supports UTF8? Check [Characters encodings supported by more, cat and less](https://unix.stackexchange.com/questions/78776/characters-encodings-supported-by-more-cat-and-less). `?` is used as an error character when trying to read ASCII data using a codepage that has no corresponding character. The data was mangled when `cat` itself read the data and sent it to stdout — Panagiotis Kanavos, Jan 15 '19 at 15:07
The reason your ASCII files worked was they happened to be in a codepage compatible with your `LANG`. If they were in an incompatible codepage you'd get question marks too. — Panagiotis Kanavos, Jan 15 '19 at 15:09
Please add the appropriate tag for your OS to your question. I think it matters. — martineau, Jan 15 '19 at 15:25
@PanagiotisKanavos but this should not matter here. 'cat' will copy character to character. It will fail if if would an invalid character. [About your first comment, the second comment is correct] — Giacomo Catenazzi, Jan 15 '19 at 16:38
@GiacomoCatenazzi on the contrary, it will try to read and output the characters using the configured codepage. If a character value isnt' valid in that codepage, it will be replaced by `?`. In any case, you can simply try `cat` and check the output — Panagiotis Kanavos, Jan 15 '19 at 16:39
@PanagiotisKanavos: ok, so maybe it depends on the OS. On my computer `LANG=C cat a | cat` gives the correct answer. Sometime ago C were still not UTF-8, but just ASCII-7 — Giacomo Catenazzi, Jan 15 '19 at 16:47
@GiacomoCatenazzi C still doesn't have a UTF8 type. There are UTF16 and UTF32 types but no specific UTF8 type which is why such problems occur in the first place. Programs are supposed to keep using the same `char` strings they used for ASCII. — Panagiotis Kanavos, Jan 15 '19 at 16:49
@PanagiotisKanavos: So I do not understand. on my C the special characters should be outside ASCII-7, but it do not substitutes them with '?'. So when will `cat` substitute characters? I think I missinterpreted your comment — Giacomo Catenazzi, Jan 15 '19 at 16:53
The tag says Windows but `cat` is a *nix utility. Which is it? — Mark Ransom, Jan 15 '19 at 18:10
`cat` must be a Windows 10 thing, I don't have it on Windows 7. — Mark Ransom, Jan 16 '19 at 19:10

score 2 · Accepted Answer · answered Jan 15 '19 at 16:42

The reason is that you took a short cut, which should never been taken.

You should always define an encoding. So when you read the file, you should specify that you are reading UTF-8, or whenever. Or just make explicit that you are reading binary files.

In your case, python interpreter will use UTF-8 as standard encoding when reading from files, because this is the default in Linux and macos.

But when you read from standard input, the default is defined by the locale encoding, or by the environment variable.

I refer to How to change the stdin encoding on python on how to solve. This answer is just to explain the cause.

The OP is using Windows, so the default encoding is ANSI, not UTF-8. — Eryk Sun, Jan 16 '19 at 05:56

score 0 · Answer 2 · answered Jan 15 '19 at 17:40

0

Thanks for the pointers. I have landed on the following implementation based on @GiacomoCatenazzi's answer and reference:

# -*- coding: utf-8 -*-
import sys
import codecs

with open("testinput.txt", "r") as ifile:
    lines = ifile.read()

with open("testout1.txt", "w") as ofile:
    for line in lines:
        ofile.write(line)

UTF8Reader = codecs.getreader('utf-8')
sys.stdin = UTF8Reader(sys.stdin)
with open("testout2.txt", "w") as ofile:
    for line in sys.stdin:
        ofile.write(line.encode('utf-8'))

I am however not sure why it is necessary to encode again after using codecs.getreader?

Paul

answered Jan 15 '19 at 17:40

Paul

813
11
27

Byte streams need decoding when read in, and encoding when written out - it's a symmetric process. One of the reasons for the dramatic changes in Unicode behavior in Python 3 is to cut down the chances for confusion or incorrect operation. – Mark Ransom Jan 15 '19 at 17:47
@Paul, for Python 3 in Windows, note that the default encoding for a non-console file (generally a pipe or disk file) is the system ANSI codepage, such as codepage 1252 in Western Europe. We have to override this with the `encoding` option when opening a file, say if we want UTF-8 instead. That's not so easy for standard I/O, but you can easily override the default for standard I/O by setting the `PYTHONIOENCODING` environment variable. – Eryk Sun Jan 16 '19 at 05:55

Difficulty with dealing with Unicode from sys.stdin

2 Answers2