Python 3.8: Escape non-ascii characters as unicode

Question

I have input and output text files which can contain non-ascii characters. Sometimes I need to escape them and sometimes I need to write the non-ascii characters. Basically if I get "Bürgerhaus" I need to output "B\u00FCrgerhaus". If I get "B\u00FCrgerhaus" I need to output "Bürgerhaus".

One direction goes fine:

>>> s1 = "B\u00FCrgerhaus"
>>> print(s1)
Bürgerhaus

however in the other direction I do not get the expected result ('B\u00FCrgerhaus'):

>>> s2 = "Bürgerhaus"
>>> s2_trans = s2.encode('utf8').decode('unicode_escape')
>>> print(s2_trans)
BÃ¼rgerhaus

I read that unicode-escape needs latin-1, I tried to encode it to it, but this did not product a result either. What am I doing wrong?

(PS: Thank you Matthias for reminding me that the conversion in the first example was not necessary.)

Your first example converts the string to UTF-8 and then converts it back to unicode. Of course the result will be the same. Try `print(s1)` and you'll get `Bürgerhaus`. — Matthias, Apr 13 '21 at 13:33
@Matthias I think what the OP is trying to acheive is to take his string with a unicode char convert it to the string represenation, then convert it back. I.e back back to the original representation with the point code in it — Chris Doyle, Apr 13 '21 at 13:35
I am also curious now how given a string `Bürgerhaus` you could have python print you the unicode escaped version `B\u00FCrgerhaus` — Chris Doyle, Apr 13 '21 at 13:44
This looks a lot like how JSON encodes strings. Are you sure you shouldn't really be using the `json` library, rather than relying on brittle escaping operations? — lenz, Apr 13 '21 at 16:25
You have the encode and decode steps the wrong way around: you start with a *string*, and you want to convert the *character* into the *escape sequence* - that is *encoding*, not decoding. So we *start* with `.encode('unicode-escape')`, and then convert the escaped ASCII bytes back into a string with `.decode('ascii')` (`'utf-8'` will also work, as will `'latin-1'`; all of these are "ASCII-transparent"). Voting to close as a typo. Note, however, that this will use `\x` style escapes for Unicode code points 255 and below, thus, `B\xfcrgerhaus`. — Karl Knechtel, Aug 05 '22 at 02:33

Maurice Meyer · Accepted Answer · 2021-04-13T15:32:50.667

You could do something like this:

charList=[]
s1 = "Bürgerhaus"

for i in [ord(x) for x in s1]:
    # Keep ascii characters, unicode characters 'encoded' as their ordinal in hex
    if i < 128:  # not sure if that is right or can be made easier!
        charList.append(chr(i))
    else:
        charList.append('\\u%04x' % i )

res = ''.join(charList)
print(f"Mixed up sting: {res}")

for myStr in (res, s1):
    if '\\u' in myStr:
        print(myStr.encode().decode('unicode-escape'))
    else:
        print(myStr)

Out:

Mixed up sting: B\u00fcrgerhaus
Bürgerhaus
Bürgerhaus

Explanation:

We are going to covert each character to it's corresponding Unicode code point.

print([(c, ord(c)) for c in s1])
[('B', 66), ('ü', 252), ('r', 114), ('g', 103), ('e', 101), ('r', 114), ('h', 104), ('a', 97), ('u', 117), ('s', 115)]

Regular ASCII characters decimal values are < 128, bigger values, like Eur-Sign, german Umlauts ... got values >= 128 (detailed table here).

Now, we are going to 'encoded' all characters >= 128 with their corresponding unicode representation.

it works, even though I do not fully understand it. could you explain what the passage after if i < 128... is doing? I understand it's a calculcation to some hex value. — Alv123, Apr 13 '21 at 15:15
This will break for characters with a codepoint above U+FFFF. Try it with an emoji: the hex code has more than four digits, but inverting the operation will strip those off, resulting in garbled text. — lenz, Apr 13 '21 at 16:26

score 0 · Answer 2 · answered Apr 13 '21 at 14:10

You can only decode() bytestrings (bytes) to [unicode] strings, and conversely, encode() [unicode] strings to bytes.

So if you want to decode a string escaped with unicode-escape, you need to first convert (encode()) it to a bytestring, e.g., using latin1 as you wrote in the question.

>>> encoded_str = 'B\\xfcrgerhaus'
>>> encoded = encoded_str.encode('latin-1')
>>> encoded
b'B\\xfcrgerhaus'
>>> encoded.decode('unicode-escape')
'Bürgerhaus'
>>> _.encode('unicode-escape')
b'B\\xfcrgerhaus'
>>> _ == encoded
True

See also: how do I .decode('string-escape') in Python3?

Python 3.8: Escape non-ascii characters as unicode

2 Answers2