1

I have many files which are encoded with UTF-8 or GBK. My system encoding is UTF-8 (LANG=zh_CN.UTF-8), so I can read files encoded with UTF-8 easily. But I must read file encoding with GBK as well. I'm following Python 3: How to specify stdin encoding here:

import sys 
import io
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
    print(line)

My question is how can I read all the files (both GBK and UTF-8) safely from sys.stdin. Or can you give me some better solution?

To slightly expand on this question, I want to handle files like this:

cat *.in | python3 handler.py

*.in returns many files encoded with either UTF-8 or GBK.

If I use the following code in handler.py

for line in sys.stdin:
    ...some code

it will throw an error as soon as it tries to process a GBK file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte

On the other hand, if I use code like this:

input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
    ...some code

it will throw an error on any UTF-8 file:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 25: illegal multibyte sequence

I want to find a safe way to handle both types of files (UTF-8 and GBK) within my script.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Yaozong Li
  • 37
  • 6
  • You need to open each file first as a stream of bytes, snoop inside it to work out the encoding for yourself, then close it and reopen it with the appropriate encoding. – BoarGules Jan 15 '18 at 08:36
  • 1
    Using `cat *.in` for files with different encoding is problematic, because it produces a single stream with inconsistent encoding, which is a nightmare to deal with. You should redesign your script to accept a list of file names, then the codec guessing can be done on a per-file basis, without the need to detect the points where the encoding changes. – lenz Jan 15 '18 at 08:58
  • If you have a static set of files, consider re-encoding the GBK files with UTF-8. – lenz Jan 15 '18 at 09:00
  • @YaozongLi your clarification was perfect! Well done. – Nathan Vērzemnieks Jan 15 '18 at 17:15

1 Answers1

2

You can read the input as raw bytes, and then examine the input to decide what to actually decode it into.

See also Reading binary data from stdin

Assuming you can read entire lines at a time (i.e. the encoding for an entire line can be expected to be consistent), I'd try to decode as utf-8, then fall back to gbk.

for raw_line in input_stream:
    try:
        line = raw_line.decode('utf-8')
    except UnicodeDecodeError:
        line = raw_line.decode('gbk')
    # ...
tripleee
  • 175,061
  • 34
  • 275
  • 318