Python 3 - Decoding bytes that contains a mix of hex and unicode

Question

I'm porting a codebase for the Lasersaur laser cutter from Python2 to Python3, and I'm having a bit of trouble decoding the serial data coming in from the onboard Arduino. The data comes in as a byte stream that mixes hex and unicode data, like so:

bytes: b'AC\xfb\xff\xff\xbfx\x85\x80\x80\xc0y\x80\x80\x80\xc0z'
data:  A C 251 255 255 x 133 128 128 y 128 128 128 z

Python2 was able to steamroll over the mixed-type data, and read in the serial data as a string of characters, after which ord() was used to determine if the character represented data or a status character. You can see how this is implemented in the original Python2 code starting at line 367 here.

ord(data): 65 67 251 255 255 120 133 128 128 121 128 128 128 122

Python3 is more stringent about encodings, and throws me the following error when I try bytes.decode('utf-8'), because it gets to the first hex data b'x\fb' and chokes because it's a different format. Messing around with a few different codecs doesn't give any better results.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 2: invalid start byte

I found this StackOverflow thread which addresses what I need pretty much exactly, but it seems that the error handling in Python 3 no longer works the same, and throws me this error when I try the solution there: TypeError: 'UnicodeDecodeError' object is not subscriptable.

I could modify the onboard code on the Arduino in order to get a saner serial encoding, but the main reason I'm porting to Python3 is that I can't get the right (read: old) Python2 libraries in order to execute the code and I don't want to run into the scenario where I inadvertently get to a state where it's impossible to communicate with the onboard arduino.

What I'd like to do is mimic the original functionality as closely as possible, and get out either a string of characters that I can call ord() on, or a mix of characters and numbers in a list. I'm a bit lost on how to do this.

Thierry Lathuille · Accepted Answer · 2019-05-31T19:23:04.600

You don't have 'mixed' data, you have a bytes object. When you print it, Python represents all bytes whose value corresponds to a letter in ASCII as a letter, in order to help us identify text among it.

You can access any individual byte as an integer by indexing:

data = b'AC\xfb\xff\xff\xbfx\x85\x80\x80\xc0y\x80\x80\x80\xc0z'

print(data[0])
# 65

The value gets returned as an integer. (here, 65, which corresponds to 'A' in ASCII, hence its representation in the bytes string.)

So, a simple way to convert the bytes to a list of integers would be:

data_as_int = [b for b in data]

or even simpler:

data_as_int = list(data)

which gives us:

print(data_as_int)
# [65, 67, 251, 255, 255, 191, 120, 133, 128, 128, 192, 121, 128, 128, 128, 192, 122]

About your idea of converting the bytes to a string, in order to use ord afterwards: you can do it, but you have to use an encoding like latin1 in which each byte corresponds to one character - which is not the case with utf8. So, you could have done something like:

data_as_int = [ord(c) for c in data.decode('latin1')]

but this is less direct than the above solution.

Thank you! The print representation of the bytes object was confusing me a bit (why did the leading AC resolve as characters but not xyz later on?), and I assumed that indexing into it wouldn’t work nicely. Your explanation clears that up for me, and I think I should be able to work with it from here. Appreciate it! — Scott, May 31 '19 at 19:37

Python 3 - Decoding bytes that contains a mix of hex and unicode

1 Answers1