0

I have this string with unicode sequences, something like

"\\u0020dfhbafhfka\\u0022dahjfdsakj\\u005Dbty"

for example. I would like to encode the unicode sequences, so the output is something like this: idfhbafhfka"dahjfdsakj]bty, where \\u0020 is encoded to a space, \\u0022 is encoded to a ", and \\u005D is encoded to a ]. If I run this:

print("\u0020".encode("UTF-8))

I get a space, which is correct. But, if I run this (string = the unicode sequence):

print(string[1:7].encode("UTF-8"))

I get this:

b'\\u0020'

I cut off the first backslash, because in the unicode sequences there is only one backslash. Also, make sure the string has literal backslashes, because if you input the string literally, the backslashes are escape sequences. One way of doing this is setting a variable to backslash with chr(92) and then putting it everywhere there is a backslash. Any help is appreciated

  • 1
    "I get this:" Good. From here, you can re-decode using the `unicode-escape` codec, as shown for example in the linked duplicate. (There may be better duplicate candidates, but the existing set of questions is a real mess to search through.) – Karl Knechtel May 02 '22 at 18:29
  • 1
    The trick is that you must *de*code in order to translate from Unicode escape sequences into the corresponding Unicode code points, but in 3.x decoding requires that you start with `bytes` (and decode into `str`, which is the kind of result you want). – Karl Knechtel May 02 '22 at 18:30
  • See also https://stackoverflow.com/questions/53362295/python-3-is-there-any-need-of-using-unicode-escape-encoding . – Karl Knechtel May 02 '22 at 18:32

0 Answers0