2

I'm trying to replace escaped Unicode characters with the actual characters:

string = "\\u00c3\\u00a4"
print(string.encode().decode("unicode-escape"))

The expected output is ä, the actual output is ä.

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
Toast
  • 596
  • 2
  • 19
  • 39
  • 1
    Those don't look like escaped Unicode characters. It's more like someone took a Unicode string, encoded it as UTF-8, then treated it as a Unicode string again and encoded *that*. – melpomene Sep 22 '18 at 20:37
  • Can you suggest a way of reversing this process? – Toast Sep 22 '18 at 20:38
  • 1
    Sorry, I don't know Python. `string.encode("ascii").decode("unicode-escape").encode("latin-1").decode("utf-8")` seems to do something, but that's just guesswork. You should probably wait until someone shows up who knows what they're doing. – melpomene Sep 22 '18 at 20:39
  • That worked! If you want to post it as an answer, I'll accept it. Thank you! – Toast Sep 22 '18 at 20:40
  • It looks a little bit like an XY-problem. [The previous question](https://stackoverflow.com/questions/52457095/convert-unicode-escape-to-hebrew-text) in the [unicode] tag shows exactly the same kind of broken text. Could you maybe share where you got this broken text it in the first place? – Andrey Tyukin Sep 22 '18 at 21:47
  • 1
    @AndreyTyukin I found the text inside a Facebook data takeout archive. https://www.facebook.com/help/1701730696756992 – Toast Sep 22 '18 at 23:34
  • 2
    Then you are already the second person today with the same encoding problem in the facebook JSON data. That's strange... Ah! Then it seems that your question is actually an XY-problem-wise duplicate of this: [Facebook JSON badly encoded](https://stackoverflow.com/questions/50008296/facebook-json-badly-encoded). Martijn Pieters also confirms that it looks like mojibake. – Andrey Tyukin Sep 22 '18 at 23:43

3 Answers3

4

The following solution seems to work in similar situations (see for example this case about decoding broken Hebrew text):

("\\u00c3\\u00a4"
  .encode('latin-1')
  .decode('unicode_escape')
  .encode('latin-1')
  .decode('utf-8')
)

Outputs:

'ä'

This works as follows:

  • The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
  • Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
  • encode it again with 'latin-1'
  • Decode it "properly" this time, as UTF-8

Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way. Not breaking it in the first place is better than breaking it and then repairing it again.

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
1

The codecs doc page states:

enter image description here

That means that output of the "unicode-escape" will be latin1, even if the default for python is utf-8.
So, you just need to encode back to latin1 and decode back to utf-8

mixed_string_to_be_unescaped =  '\u002Fq:85\\u002FczM"},{\"name\":\"Santé\",\"parent_name\":\"Santé'

val = codecs.decode(mixed_string_to_be_unescaped, 'unicode-escape')
val = val.encode('latin1').decode('utf-8')
print(val)

/q:85/czM"},{"name":"Santé","parent_name":"Santé

The above solution works, but to me was not clear because I didn't get why I should convert to latin-1 before the unicode_escape (discovered that was doing this automatically), neither why it was using unicode_escape in an unescaped string.

1

I've spent a good few moments to understand this, so sharing here for potential future readers.

This is one of promoted questions re: decodeing espaced Unicode characters, but this is very special situation. The original string here has been created in a strange way, probably after encoding and decoding several times. The final output is just one character, that has Unicode code point \u00E4. If it was stored in the file as '\u00E4', that could be converted using "\u00E4".encode('latin-1').decode('unicode_escape')

But here, it's utf-8 code point - 2 bytes and these two bytes are represented as a sequence of escaped Unicode characters.

MkL
  • 43
  • 6