So, I'll assume that what you somehow get a raw ASCII string that contains escape sequences with UTF-16 code units that form surrogate pairs, and that you (for whatever reason) want to convert it to \UXXXXXXXX
-format.
So, henceforth I assume that your input (bytes!) look like this:
weirdInput = "hello \\ud83d\\ude04".encode("latin_1")
Now you want to do the following:
- Interpret the bytes in a way that
\uXXXX
thingies are transformed into UTF-16 code units. There is raw_unicode_escapes
, but unfortunately it needs a separate pass to fix the surrogate pairs (I don't know why, to be honest)
- Fix the surrogate pairs, transform the data into valid UTF-16
- Decode as valid UTF-16
- Again, encode as "raw_unicode_escape"
- Decode back as good old
latin_1
, consisting only of good old ASCII with unicode escape sequences in format \UXXXXXXXX
.
Something like this:
output = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
.encode("raw_unicode_escape")
.decode("latin_1")
)
Now if you print(output)
, you get:
hello \U0001f604
Note that if you stop at an intermediate stage:
smiley = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
)
then you get a unicode-string with smileys:
print(smiley)
# hello
Full code:
weirdInput = "hello \\ud83d\\ude04".encode("latin_1")
output = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
.encode("raw_unicode_escape")
.decode("latin_1")
)
smiley = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
)
print(output)
# hello \U0001f604
print(smiley)
# hello