3

I would like to print strings encoded like this one: "Cze\u00c5\u009b\u00c4\u0087" but I have no idea how. The example string should be printed as: "Cześć".

What I have tried is:

str = "Cze\u00c5\u009b\u00c4\u0087"
print(str) 
#gives: CzeÅÄ

str_bytes = str.encode("unicode_escape")
print(str_bytes) 
#gives: b'Cze\\xc5\\x9b\\xc4\\x87'

str = str_bytes.decode("utf8")
print(str) 
#gives: Cze\xc5\x9b\xc4\x87

Where

print(b"Cze\xc5\x9b\xc4\x87".decode("utf8"))

gives "Cześć", but I don't know how to transform the "Cze\xc5\x9b\xc4\x87" string to the b"Cze\xc5\x9b\xc4\x87" bytes.

I also know that the problem are additional backslashes in the byte representation after encoding the basis string with "unicode_escape" parameter, but I don't know how to get rid of them - str_bytes.replace(b'\\\\', b'\\') doesn't work.

sniperd
  • 5,124
  • 6
  • 28
  • 44
daniel
  • 107
  • 6
  • Regarding your last point, `str_bytes = str_bytes.replace(b'\\\\', b'\\')` should fix that issue - you probably weren't assigning it back to a variable. – lhay86 Jul 11 '18 at 20:08
  • @Ihay86 Unfortunately it doesn't work. It returns the same list of bytes. – daniel Jul 11 '18 at 20:28
  • 1
    BTW, don't use `str` as a variable name, since that shadows the built-in `str` type. – PM 2Ring Jul 11 '18 at 20:31
  • 1
    The _real_ question is: Why do you have strings encoded like that? Ideally, they should be fixed upstream. You shouldn't have UTF-8 bytes encoded into a text string like that! Matias's answer works, another way to deal with this sort of mojibake is `s.encode('latin1').decode('utf8')`. – PM 2Ring Jul 11 '18 at 20:34
  • @PM2Ring This is what you get if you download a copy of your Facebook information in .json format. – daniel Jul 11 '18 at 20:43
  • @PM2Ring By using Facebook interface: [Accessing & Downloading Your Information](https://www.facebook.com/help/1701730696756992) – daniel Jul 11 '18 at 20:58
  • 1
    Ok. This is a known issue, see [Facebook JSON badly encoded](https://stackoverflow.com/q/50008296/4014959). Martijn Pieters♦, who works at Facebook, has filed an internal bug report. – PM 2Ring Jul 12 '18 at 14:26

1 Answers1

6

Use raw_unicode_escape:

text = 'Cze\u00c5\u009b\u00c4\u0087'
text_bytes = text.encode('raw_unicode_escape')
print(text_bytes.decode('utf8')) # outputs Cześć
Matias Cicero
  • 25,439
  • 13
  • 82
  • 154
  • 1
    This is JSON data, they should be decoding it as JSON. It is also a mojibake, of course, but `raw_unicode_escape` is not the right tool here and can cause issues if there are literal backslashes in the input followed by known Python escape sequences (but which JSON would have ignored). – Martijn Pieters Jul 12 '18 at 16:51