Printing strings with UTF-8 encoded characters, e.g.: "\u00c5\u009b\"

Question

I would like to print strings encoded like this one: "Cze\u00c5\u009b\u00c4\u0087" but I have no idea how. The example string should be printed as: "Cześć".

What I have tried is:

str = "Cze\u00c5\u009b\u00c4\u0087"
print(str) 
#gives: CzeÅÄ

str_bytes = str.encode("unicode_escape")
print(str_bytes) 
#gives: b'Cze\\xc5\\x9b\\xc4\\x87'

str = str_bytes.decode("utf8")
print(str) 
#gives: Cze\xc5\x9b\xc4\x87

Where

print(b"Cze\xc5\x9b\xc4\x87".decode("utf8"))

gives "Cześć", but I don't know how to transform the "Cze\xc5\x9b\xc4\x87" string to the b"Cze\xc5\x9b\xc4\x87" bytes.

I also know that the problem are additional backslashes in the byte representation after encoding the basis string with "unicode_escape" parameter, but I don't know how to get rid of them - str_bytes.replace(b'\\\\', b'\\') doesn't work.

Regarding your last point, `str_bytes = str_bytes.replace(b'\\\\', b'\\')` should fix that issue - you probably weren't assigning it back to a variable. — lhay86, Jul 11 '18 at 20:08
@Ihay86 Unfortunately it doesn't work. It returns the same list of bytes. — daniel, Jul 11 '18 at 20:28
BTW, don't use `str` as a variable name, since that shadows the built-in `str` type. — PM 2Ring, Jul 11 '18 at 20:31
The _real_ question is: Why do you have strings encoded like that? Ideally, they should be fixed upstream. You shouldn't have UTF-8 bytes encoded into a text string like that! Matias's answer works, another way to deal with this sort of mojibake is `s.encode('latin1').decode('utf8')`. — PM 2Ring, Jul 11 '18 at 20:34
@PM2Ring This is what you get if you download a copy of your Facebook information in .json format. — daniel, Jul 11 '18 at 20:43
@PM2Ring By using Facebook interface: [Accessing & Downloading Your Information](https://www.facebook.com/help/1701730696756992) — daniel, Jul 11 '18 at 20:58
Ok. This is a known issue, see [Facebook JSON badly encoded](https://stackoverflow.com/q/50008296/4014959). Martijn Pieters♦, who works at Facebook, has filed an internal bug report. — PM 2Ring, Jul 12 '18 at 14:26

score 6 · Accepted Answer · answered Jul 11 '18 at 20:14

6

Use raw_unicode_escape:

text = 'Cze\u00c5\u009b\u00c4\u0087'
text_bytes = text.encode('raw_unicode_escape')
print(text_bytes.decode('utf8')) # outputs Cześć

answered Jul 11 '18 at 20:14

Matias Cicero

25,439
13
82
154

1

This is JSON data, they should be decoding it as JSON. It is also a mojibake, of course, but `raw_unicode_escape` is not the right tool here and can cause issues if there are literal backslashes in the input followed by known Python escape sequences (but which JSON would have ignored). – Martijn Pieters Jul 12 '18 at 16:51

Printing strings with UTF-8 encoded characters, e.g.: "\u00c5\u009b\"

1 Answers1