read files with different encoding format using sys.stdin in python3

Question

I have many files which are encoded with UTF-8 or GBK. My system encoding is UTF-8 (LANG=zh_CN.UTF-8), so I can read files encoded with UTF-8 easily. But I must read file encoding with GBK as well. I'm following Python 3: How to specify stdin encoding here:

import sys 
import io
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
    print(line)

My question is how can I read all the files (both GBK and UTF-8) safely from sys.stdin. Or can you give me some better solution?

To slightly expand on this question, I want to handle files like this:

cat *.in | python3 handler.py

*.in returns many files encoded with either UTF-8 or GBK.

If I use the following code in handler.py

for line in sys.stdin:
    ...some code

it will throw an error as soon as it tries to process a GBK file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte

On the other hand, if I use code like this:

input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
    ...some code

it will throw an error on any UTF-8 file:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 25: illegal multibyte sequence

I want to find a safe way to handle both types of files (UTF-8 and GBK) within my script.

You need to open each file first as a stream of bytes, snoop inside it to work out the encoding for yourself, then close it and reopen it with the appropriate encoding. — BoarGules, Jan 15 '18 at 08:36
Using `cat *.in` for files with different encoding is problematic, because it produces a single stream with inconsistent encoding, which is a nightmare to deal with. You should redesign your script to accept a list of file names, then the codec guessing can be done on a per-file basis, without the need to detect the points where the encoding changes. — lenz, Jan 15 '18 at 08:58
If you have a static set of files, consider re-encoding the GBK files with UTF-8. — lenz, Jan 15 '18 at 09:00

score 2 · Accepted Answer · answered Jan 15 '18 at 09:09

You can read the input as raw bytes, and then examine the input to decide what to actually decode it into.

read files with different encoding format using sys.stdin in python3

1 Answers1

Linked