I am reading some mainly HEX input into a Python3 script. However, the system
is set to use UTF-8
and when piping from Bash shell into the script, I keep
getting the following UnicodeDecodeError
error:
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)
I'm using sys.stdin.read()
in Python3 to read the piped input, according to other SO answers, like this:
import sys
...
isPipe = 0
if not sys.stdin.isatty() :
isPipe = 1
try:
inpipe = sys.stdin.read().strip()
except UnicodeDecodeError as e:
err_unicode(e)
...
It works when piping using this way:
# echo "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
<output all ok!>
However, using the raw format doesn't:
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1"
▒▒▒
▒▒
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)
and also tried other promising SO answers:
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "open(1,'w').write(open(0).read())"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "from io import open; open(1,'w').write(open(0).read())"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
From what I have learned so far, is that when your terminal is encountering a UTF-8 sequence, it is expecting it to be followed by 1-3 other bytes, like this:
UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes. So anything after the leading byte (first UTF-8 character in range of
0xC2 - 0xF4
) to be followed by 1-3 continuation bytes, in the range0x80 - 0xBF
.
However, I cannot always be sure where my input stream come from, and it may very well be raw data and not the ASCII HEX'ed versions as above. So I need to deal with this raw input somehow.
I've looked at a few alternatives, like:
to use codecs.decode
to use
open("myfile.jpg", "rb", buffering=0)
with raw i/ousing
bytes.decode(encoding="utf-8", errors="ignore")
from bytesor just using open(...)
But I don't know if or how they could read a piped input stream, like I want.
How can I make my script handle also a raw byte stream?
PS. Yes, I have read loads of similar SO issues, but none of them are adequately dealing with this UTF-8 input error. The best one is this one.
This is not a duplicate.