Character encoding error difficult to handle in code

Question

My Python3 application receives from stdin from a external device. The character stream can sometimes have accented characters. The immediate problem is 0xE9, or an accented e. The application looks somewhat like this:

while True:
    for raw_line in sys.stdin:
        self.__process_line(raw_line)

When the input line containing 0xE9 is encountered, this error occurs: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 7: invalid continuation byte so presumably the input stream is not assumed to be UTF-8. If I change this line of code to: self.__process_line(raw_line.encode('latin1')) as discussed in other posts on SO related to this error, then this piece of code is happy, but the regexs that process the line fail:

__start_order_re = re.compile(r"\AStart Order\s+\d+\Z")
m = self.__start_order_re.match(line)
if m is not None:
    Receipt.__logger("Start of order found")
        self.__reset_state()

TypeError: cannot use a string pattern on a bytes-like object

If I change the regexes to be byte strings like: __start_order_re = re.compile(b"\AStart Order\s+\d+\Z") then they succeed, but the errors cascade and every string that interacts with the line similarly fails. This seems wrong. It seems like I should not need to explicitly use b" all over with related conversions.

How is the done incorrectly? Where should the character set issues be handled so the code works without explicit managing b" data all over the place?

Possible duplicate of [ascii codec cant decode byte 0xe9](https://stackoverflow.com/questions/28947607/ascii-codec-cant-decode-byte-0xe9) — r.ook, Feb 03 '18 at 03:29
Not a duplicate. This is what happens after you avoid that problem. — George Shaw, Feb 03 '18 at 08:41

score 0 · Accepted Answer · answered Feb 06 '18 at 04:14

Ends up that it was failing differently in operation than in test, so the side effects seen were wrong. It was failing in processing stdin, before getting to the app. The problem is described here: Python 3: How to specify stdin encoding, so I needed this change:

sys.stdin  = io.TextIOWrapper(sys.stdin.buffer,  encoding="latin-1")
while True:
    for raw_line in sys.stdin:
        self.__process_line(raw_line)

Character encoding error difficult to handle in code

1 Answers1