1

I have a python module that I need to adapt from py2 to py3. The problem is, it accepts an std::string from a C++ module as part of a struct, which was readable in py2 since the default py2 string type was bytes. When trying to launch it with py3, however, it tries to interpret that string with utf8 whenever I try to do anything with it.

Basically, the message deserializer is expecting a bytes-like object, but is getting a normal, albeit unreadable, string instead.

For instance, this doesn't work:

msg_raw_data = bytes(msg.raw_data, encoding='latin-1')
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4: invalid start byte

Unfortunately, I cannot change the way the string comes into the module, but I don't need to read that string as an actual valid string - I just need to extract a bytes object from it without discarding any values. Is there a way to do that?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
fwiffo
  • 45
  • 6
  • This is just a character decoding issue, try a different encoding such as `windows-1252`: https://stackoverflow.com/a/48067785/1399491 – Alex W Jul 26 '21 at 12:30
  • I have tried a few different encodings that I've found in various SO questions, including `windows-1252`, `ascii`, `latin-1`, `string_escape`, `unicode_escape`, `raw_unicode_escape`, but none of them have worked so far. – fwiffo Jul 26 '21 at 12:42
  • Have you tried using something like [chardet](https://pypi.org/project/chardet/) ? – Alex W Jul 26 '21 at 14:54
  • No, but the problem is that unlike the person in the question you've linked, I do not have the luxury of choosing encoding when opening a file - what I get is a string object directly, though the service that is sending it is highly likely sending a `bytes` object. That's why I don't need to try and decode that as a string, instead I just need a way to extract the underlying bytes without discarding them (so can't use errors='ignore' parameter). – fwiffo Jul 26 '21 at 17:10
  • This error description does not make sense. Getting a `UnicodeDecodeError` from an attempt to *encode* to bytes is **2.x**-specific behaviour. (It happens because a request to encode bytes-to-bytes is *implicitly* decoding the bytes to unicode first, so that the unicode can then be explicitly encoded.) 3.x does not do these things, because it correctly handles text as Unicode and `bytes` literals as a separate thing that is not substitutable for text without being explicitly decoded. At any rate, there is no way anyone could have diagnosed this properly without a [mre]. Voting to close. – Karl Knechtel Jul 04 '22 at 11:17
  • In any event, one of the first tenets of [debugging](https://ericlippert.com/2014/03/05/how-to-debug-small-programs/) is to check your assumptions. In this case, starting with *what version of Python is actually in use*, and then `type(msg.raw_data)`. – Karl Knechtel Jul 04 '22 at 11:19
  • The actual Python version was 3.6 if I recall correctly, but as stated in the original question, I was trying to convert an existing python 2 module into python 3, which had a binding with C++ code that passed a python2 valid string which became invalid in python 3 - but it was still typed as string due to how the bindings were set up. This was a last resort at trying to resolve this on python side, if there was a way to reinterpret the invalid string object into the bytes object without evaluating it as a string first - which didn't work. – fwiffo Aug 15 '22 at 17:57

1 Answers1

0

For the lack of a better option, had to ask the C++ team to change their python bindings to return a bytes wrapper instead of std::string from their side.

fwiffo
  • 45
  • 6