My Python3 application receives from stdin from a external device. The character stream can sometimes have accented characters. The immediate problem is 0xE9, or an accented e. The application looks somewhat like this:
while True:
for raw_line in sys.stdin:
self.__process_line(raw_line)
When the input line containing 0xE9 is encountered, this error occurs:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 7: invalid continuation byte
so presumably the input stream is not assumed to be UTF-8. If I change this line of code to:
self.__process_line(raw_line.encode('latin1'))
as discussed in other posts on SO related to this error, then this piece of code is happy, but the regexs that process the line fail:
__start_order_re = re.compile(r"\AStart Order\s+\d+\Z")
m = self.__start_order_re.match(line)
if m is not None:
Receipt.__logger("Start of order found")
self.__reset_state()
TypeError: cannot use a string pattern on a bytes-like object
If I change the regexes to be byte strings like:
__start_order_re = re.compile(b"\AStart Order\s+\d+\Z")
then they succeed, but the errors cascade and every string that interacts with the line similarly fails. This seems wrong. It seems like I should not need to explicitly use b"
all over with related conversions.
How is the done incorrectly? Where should the character set issues be handled so the code works without explicit managing b"
data all over the place?