1

I have a string in python with some literal bytes and some literal ascii characters, e.g., print(s) provides:

The message - \xe3\x83\xa9\xe3\x82\xa4 ... that was the message

Is there an easy way to reinterpret this string as bytes in python, and then decode to utf-16? Or do I have to manually search for and separate out the unicode characters myself?

If I could declare the string as a literal, I would be fine, e.g.,

b"The message - \xe3\x83\xa9\xe3\x82\xa4 ... that was the message".decode('utf-8')

but unfortunately I have a string variable.

Matt
  • 310
  • 3
  • 10
  • You're example for the string literal does not seem to work on python 3.6 – vidstige May 12 '18 at 05:07
  • @user202729 - Not at all, if you use bytes(str) you have to specify the encoding which will convert the literal bytes characters to bytes as opposed to their unicode equivalent. The thing that makes this different as we are dealing with a string including characters that describe bytes, as opposed the the bytes themselves. – Matt May 12 '18 at 05:09
  • @vidstige - my apologies, I truncated the bytes incorrectly - I've updated the question. – Matt May 12 '18 at 05:12
  • @Matt Just use a suitable encoding. – user202729 May 12 '18 at 05:12
  • The accepted answer there also say that you can use an iterable, and it's possible to convert a string to an iterable. – user202729 May 12 '18 at 05:13
  • @user202729 You're still stuck with literal bytes as opposed to unicode characters, regardless of the encoding. The bytes in my string are literally the characters, e.g., "\" and "x" and "3" for the first byte shown there. You can iterate over it but you'll have to detect the "\x" sequence, find the end of the byte and convert each one individually. There must be a better way. – Matt May 12 '18 at 05:15
  • Can you explain why a variable won't work, but the literal work? Can you give the error message? – vidstige May 12 '18 at 05:15
  • Just to be clear, you mean to say `print(my_string)` give what you are showing? – juanpa.arrivillaga May 12 '18 at 05:16
  • [That works for me](https://tio.run/##pc6xboNADAbg/Z7iVzoAUjkpTYdk4C06shhw4KTkDp1NAk9PTyjp0CFLBku/LP@fPC46BH9YV1FU2P0MjCuLUM8oUc98qOdjGjo98lfK37DWQgdS3ElS@OvsjOF55Fa5S1rzNmc7bkPHeTbpuTxmhTEfmIRBkMkpNRcG@3ThfG/G6LzmonbbpM6F1PlynxX/FVQVnm9uZBv8jaNCA5xy3NxwRvK45ygPulmUKUZa8iuNeYjdp2jxGl/XXw). No idea what your problem is. You're not clear enough. – user202729 May 12 '18 at 05:18
  • @vidstige - There is no error message, I'm just stuck with a string with ascii characters representing bytes as opposed to a unicode string. I've just worked out a solution, posting it now. – Matt May 12 '18 at 05:19
  • 1
    Then probably [this](https://stackoverflow.com/questions/1885181/how-do-i-un-escape-a-backslash-escaped-string-in-python). – user202729 May 12 '18 at 05:19
  • @juanpa.arrivillaga yes, without the quotation marks, but for the first example, yes. I'll update the question. – Matt May 12 '18 at 05:20
  • 1
    @user202729 close, but that won't work in Python 3. – juanpa.arrivillaga May 12 '18 at 05:25
  • @juanpa.arrivillaga The lowest voted answer there does. As the Python3 question is a subset of the Python question shoudl they be merged? – user202729 May 12 '18 at 05:27
  • So, the solution in Python 3 is: `import codecs; print(codecs.escape_decode(s)[0].decode())` – juanpa.arrivillaga May 12 '18 at 05:30
  • 1
    Given that it `codecs.escape_decode` [is undocumented](https://bugs.python.org/issue25270) and there's no reason to believe it couldn't be removed, this may be a place where the use of `eval`/`ast.literal_eval` would make sense – juanpa.arrivillaga May 12 '18 at 05:33
  • Thanks @juanpa.arrivillaga I was concerned about eval because of the security issues, but the codecs library works. – Matt May 12 '18 at 05:37
  • 1
    @Matt `import ast; ast.literal_eval` should be safe. Again, the docs in CPython mention it might be removed, and it is undocumented. So, if this is a one-time thing then go ahead, but I would be wary of putting it into production – juanpa.arrivillaga May 12 '18 at 05:39
  • Noted, thanks again. – Matt May 12 '18 at 05:43

0 Answers0