How to prevent "UnicodeDecodeError" when reading piped input from sys.stdin?

Question

I am reading some mainly HEX input into a Python3 script. However, the system is set to use UTF-8 and when piping from Bash shell into the script, I keep getting the following UnicodeDecodeError error:

UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

I'm using sys.stdin.read() in Python3 to read the piped input, according to other SO answers, like this:

import sys
...
isPipe = 0
if not sys.stdin.isatty() :
    isPipe = 1
    try:
        inpipe = sys.stdin.read().strip()
    except UnicodeDecodeError as e:
        err_unicode(e)
...

It works when piping using this way:

# echo "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
<output all ok!>

However, using the raw format doesn't:

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1"

    ▒▒▒
   ▒▒

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | some.py
UnicodeDecodeError: ('utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte)

and also tried other promising SO answers:

# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "open(1,'w').write(open(0).read())"
# echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "from io import open; open(1,'w').write(open(0).read())"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

From what I have learned so far, is that when your terminal is encountering a UTF-8 sequence, it is expecting it to be followed by 1-3 other bytes, like this:

UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes. So anything after the leading byte (first UTF-8 character in range of 0xC2 - 0xF4) to be followed by 1-3 continuation bytes, in the range 0x80 - 0xBF.

However, I cannot always be sure where my input stream come from, and it may very well be raw data and not the ASCII HEX'ed versions as above. So I need to deal with this raw input somehow.

I've looked at a few alternatives, like:

to use codecs.decode
to use open("myfile.jpg", "rb", buffering=0) with raw i/o
using bytes.decode(encoding="utf-8", errors="ignore") from bytes
or just using open(...)

But I don't know if or how they could read a piped input stream, like I want.

How can I make my script handle also a raw byte stream?

PS. Yes, I have read loads of similar SO issues, but none of them are adequately dealing with this UTF-8 input error. The best one is this one.

This is not a duplicate.

It doesn’t matter if (some of) your input happens to be hexadecimal numbers. But by “raw” you mean arbitrary *binary* input, right? — Davis Herring, Oct 27 '18 at 21:57
@DavisHerring Yes, *binary*. However, I don't agree that my question is a duplicate just because there *may* be an embedded answer remotely related to mine, within it. The question (you linked) as formulated, is completely different from mine, and its very unlikely anyone would search for those words when encountering my problem or error. — not2qubit, Oct 28 '18 at 05:42
It's hardly "remotely related": that question concerns reading _and_ writing binary data, but the first three sentences of the one answer answer this question entirely. And I found it by searching for terms related to this question, although I agree that its title is a bit lacking for a "canonical `buffer` question". — Davis Herring, Oct 28 '18 at 14:55

not2qubit · Accepted Answer · 2018-10-29T23:06:49.080

I finally managed to work around this issue by not using sys.stdin!

Instead I used with open(0, 'rb'). Where:

0 is the file pointer equivalent to stdin.
'rb' is using binary mode for reading.

This seem to circumvent the issues with the system trying to interpret your locale character in the pipe. I got the idea after seeing that the following worked, and returned the correct (non-printable) characters:

echo -en "\xed\xff\xff\x0b\x04\x00\xa0\xe1" | python3 -c "with open(0, 'rb') as f: x=f.read(); import sys; sys.stdout.buffer.write(x);"

▒▒▒
   ▒▒

So to correctly read any pipe data, I used:

if not sys.stdin.isatty() :
    try:
        with open(0, 'rb') as f: 
            inpipe = f.read()

    except Exception as e:
        err_unknown(e)        
    # This can't happen in binary mode:
    #except UnicodeDecodeError as e:
    #    err_unicode(e)
...

That will read your pipe data into a python byte string.

The next problem was to determine whether or not the pipe data was coming from a character string (like echo "BADDATA0") or from a binary stream. The latter can be emulated by echo -ne "\xBA\xDD\xAT\xA0" as shown in OP. In my case I just used a RegEx to look for out of bounds non ASCII characters.

if inpipe :
    rx = re.compile(b'[^0-9a-fA-F ]+') 
    r = rx.findall(inpipe.strip())
    if r == [] :
        print("is probably a HEX ASCII string")
    else:
        print("is something else, possibly binary")

Surely this could be done better and smarter. (Feel free to comment!)

Addendum: (from here)

mode is an optional string that specifies the mode in which the file is opened. It defaults to r which means open for reading in text mode. In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The default mode is 'r' (open for reading text, synonym of 'rt'). For binary read-write access, the mode w+b opens and truncates the file to 0 bytes. r+b opens the file without truncation.

... Python distinguishes between binary and text I/O. Files opened in binary mode (including b in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when t is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

If closefd is False and a file descriptor rather than a filename was given, the underlying file descriptor will be kept open when the file is closed. If a filename is given, closefd must be True (the default) otherwise an error will be raised.

You should pass closefd=False into open so that the with statement doesn't close stdin when it finishes. Also opening and reading a file in binary can't raise a UnicodeDecodeError. That gets thrown when a bytes is decoded into a string, which occurs when you read a file as text (use open without 'b' and read the file) or when you use the bytes.decode function. — daz, Oct 29 '18 at 20:56
@daz Yes, I see now that my first `open()` trials was not using the `b` flag. Also, using `closefd=False` doesn't seem to make any difference. So why do you think that's important? Then again I haven't tried interrupting the flow from the input. — not2qubit, Oct 29 '18 at 23:02
It probably won't matter for whatever script you are using. But without closefd, stdin is closed by the with statement so you wouldn't be able to use it afterwards. The standard streams are always expected to be opened. — daz, Oct 30 '18 at 00:49

score 3 · Answer 2 · answered Oct 28 '18 at 06:05

Here is a hacky way to read stdin in binary like a file:

import sys

with open(sys.stdin.fileno(), mode='rb', closefd=False) as stdin_binary:
    raw_input = stdin_binary.read()
try:
    # text is the string formed by decoding raw_input as unicode
    text = raw_input.decode('utf-8')
except UnicodeDecodeError:
    # raw_input is not valid unicode, do something else with it

score 2 · Answer 3 · edited Jul 08 '20 at 17:16

2

Use sys.stdin.buffer.raw instead of sys.stdin

edited Jul 08 '20 at 17:16

Panagiotis Simakis

1,245
1
18
45

answered Jul 08 '20 at 14:52

tian lan

21
1

1

Adding some information to your answer, for example why it is better to use `sys.stdin.buffer.raw` over `sys.stdin` would make this a much better answer. [From review](https://stackoverflow.com/review/low-quality-posts/26624093) – Pranav Hosangadi Jul 08 '20 at 20:32

How to prevent "UnicodeDecodeError" when reading piped input from sys.stdin?

3 Answers3