1

I have a string like:

s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

I need to be able to get the corresponding byte literal of that unicode (for pickle.loads):

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

Here the solution of using s_new: bytes = bytes(s_str, encoding="raw_unicode_escape") was posted, but it does not work for me. I got an incorrect result: b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04' that has two backslashes (actually representing only one) for each one that it should have.

Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Why does this occur? How do I get the bytes result I want?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • Other than the solution in the answers [this answer](https://stackoverflow.com/a/49990817/16153744) also works. – Tereso del Río Almajano Sep 07 '21 at 11:53
  • 1
    What "byte literal of that unicode" mean? Unicode has just code points, no byte representation (so abstract). It also defines few encodings, but so you should specify which encoding. Note: your initial string is already problematic: what do you mean with `\x`? Whant do you mean with `\xc0`? '\x' should not be used on unicode strings (but just on encoded strings or binary data). For unicode just use codepoints (\u and \U). I think your main problem is that you are mixing too many concepts (on a non recommended way), so it is easy to get it wrong. – Giacomo Catenazzi Sep 07 '21 at 12:27
  • It is not possible to get `s_not_bytes` (the result of `s_new`) from `s_str` as you have shown. `print(repr(s_str))` and post that. – Mark Tolonen Sep 07 '21 at 16:19
  • The `"raw-unicode-escape"` encoding is what you want for the problem you described, and works for the input you show. Based on the answer that was given, and the symptoms described, the diagnosis is that `s_str` *actually contains* the backslashes. I [edit]ed the question to reflect that. I assume that's what you were trying to get at by talking about "raw Unicode"; but none of that part actually described it properly. – Karl Knechtel Aug 06 '22 at 00:49

3 Answers3

1

You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:

s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

Note the difference. \\ is an escape code indicating a literal, single backslash:

>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36

The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:

s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)

Output:

b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

If you did have s_str as posted, a simple .encode('latin1') would convert it:

>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thanks, this solves the issue. I was reading this from a file using ```open(file,'r')``` and I guess that creates a raw string. – Tereso del Río Almajano Sep 08 '21 at 09:44
  • And is there a way of reading from a file containing (raw text) either ```b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'```or ```\x00\x01\x00\xc0\x01\x00\x00\x00\x04``` so that it will be considered directly a string of bytes of length 9? – Tereso del Río Almajano Sep 08 '21 at 09:53
  • @TeresodelRíoAlmajano Reading a file doesn’t create a raw string. Raw strings are a way of creating string literals in code without interpreting escape codes. Your file had text with escape-code-like text. You can `open(file,encoding='unicode_escape')` if needed, but it would be better to post an actual sample of the file in case their is a better solution. – Mark Tolonen Sep 08 '21 at 11:45
  • "and I guess that creates a raw string" It does not "create a raw string". There is not such a thing as a "raw string". However, reading a file into a string does mean that the string contains what the file actually contains - if there's a backslash followed by a lowercase n, then it's a backslash followed by a lowercase n, **not** a newline. Escape sequences **only** apply to *string literals in your source code, unless you explicitly do something to interpret them*. They apply *before the code runs*. – Karl Knechtel Aug 06 '22 at 00:53
0

I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:

s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")

As I said I have no idea why this works so feel free to explain it if you know why.

0

You might simply use .encode("utf-8") to get desired result i.e.:

s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)

output

b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'
Daweo
  • 31,313
  • 3
  • 12
  • 25