I have many files which are encoded with UTF-8 or GBK. My system encoding is UTF-8 (LANG=zh_CN.UTF-8
), so I can read files encoded with UTF-8 easily. But I must read file encoding with GBK as well. I'm following Python 3: How to specify stdin encoding here:
import sys
import io
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
print(line)
My question is how can I read all the files (both GBK and UTF-8) safely from sys.stdin
. Or can you give me some better solution?
To slightly expand on this question, I want to handle files like this:
cat *.in | python3 handler.py
*.in
returns many files encoded with either UTF-8 or GBK.
If I use the following code in handler.py
for line in sys.stdin:
...some code
it will throw an error as soon as it tries to process a GBK file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte
On the other hand, if I use code like this:
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
...some code
it will throw an error on any UTF-8 file:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 25: illegal multibyte sequence
I want to find a safe way to handle both types of files (UTF-8 and GBK) within my script.