Converting string to bytes gives UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4: invalid start byte

Question

I have a python module that I need to adapt from py2 to py3. The problem is, it accepts an std::string from a C++ module as part of a struct, which was readable in py2 since the default py2 string type was bytes. When trying to launch it with py3, however, it tries to interpret that string with utf8 whenever I try to do anything with it.

Basically, the message deserializer is expecting a bytes-like object, but is getting a normal, albeit unreadable, string instead.

For instance, this doesn't work:

msg_raw_data = bytes(msg.raw_data, encoding='latin-1')
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4: invalid start byte

Unfortunately, I cannot change the way the string comes into the module, but I don't need to read that string as an actual valid string - I just need to extract a bytes object from it without discarding any values. Is there a way to do that?

This is just a character decoding issue, try a different encoding such as `windows-1252`: https://stackoverflow.com/a/48067785/1399491 — Alex W, Jul 26 '21 at 12:30
I have tried a few different encodings that I've found in various SO questions, including `windows-1252`, `ascii`, `latin-1`, `string_escape`, `unicode_escape`, `raw_unicode_escape`, but none of them have worked so far. — fwiffo, Jul 26 '21 at 12:42
Have you tried using something like [chardet](https://pypi.org/project/chardet/) ? — Alex W, Jul 26 '21 at 14:54
No, but the problem is that unlike the person in the question you've linked, I do not have the luxury of choosing encoding when opening a file - what I get is a string object directly, though the service that is sending it is highly likely sending a `bytes` object. That's why I don't need to try and decode that as a string, instead I just need a way to extract the underlying bytes without discarding them (so can't use errors='ignore' parameter). — fwiffo, Jul 26 '21 at 17:10
This error description does not make sense. Getting a `UnicodeDecodeError` from an attempt to *encode* to bytes is **2.x**-specific behaviour. (It happens because a request to encode bytes-to-bytes is *implicitly* decoding the bytes to unicode first, so that the unicode can then be explicitly encoded.) 3.x does not do these things, because it correctly handles text as Unicode and `bytes` literals as a separate thing that is not substitutable for text without being explicitly decoded. At any rate, there is no way anyone could have diagnosed this properly without a [mre]. Voting to close. — Karl Knechtel, Jul 04 '22 at 11:17
In any event, one of the first tenets of [debugging](https://ericlippert.com/2014/03/05/how-to-debug-small-programs/) is to check your assumptions. In this case, starting with *what version of Python is actually in use*, and then `type(msg.raw_data)`. — Karl Knechtel, Jul 04 '22 at 11:19
The actual Python version was 3.6 if I recall correctly, but as stated in the original question, I was trying to convert an existing python 2 module into python 3, which had a binding with C++ code that passed a python2 valid string which became invalid in python 3 - but it was still typed as string due to how the bindings were set up. This was a last resort at trying to resolve this on python side, if there was a way to reinterpret the invalid string object into the bytes object without evaluating it as a string first - which didn't work. — fwiffo, Aug 15 '22 at 17:57

score 0 · Accepted Answer · answered Jul 27 '21 at 11:53

0

For the lack of a better option, had to ask the C++ team to change their python bindings to return a bytes wrapper instead of std::string from their side.

answered Jul 27 '21 at 11:53

fwiffo

45
6

Converting string to bytes gives UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4: invalid start byte

1 Answers1