How to escape unicode special chars in string and write it to UTF encoded file

Question

What I aim to achieve is to:

string like:

Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.

convert to:

'Bitte \u00FCberpr\u00FCfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und l\u00F6schen Sie dann die tats\u00E4chlichen Dokumente.'

and write it in this form to file (which is UTF-8 encoded).

repr() produces the same string - did you something particular in mind ? — PiWo, Jul 15 '21 at 10:47

Jiří Baum · Accepted Answer · 2021-07-15T13:01:35.700

Another solution, not relying on the built-in repr() but rather implementing it from scratch:

orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)

print(enc)

Differences:

Encodes only using \u, never any other sequence, whereas repr() uses about a third of the alphabet (so for example the BEL character will be encoded as \u0007 rather than \a)
Upper-case encoding, as specified (\u00FC rather than \u00fc)
Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
It does not take care of any pre-existing \u sequences, whereas repr() turns those into \\u; could be extended, perhaps to encode \ as \u005C:
```
enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)
```

This is very clean solution. Thank you :) Exactly what I needed. It actually converts some character that the "ascii" method did not, e.g. : „” — PiWo, Jul 16 '21 at 07:27

S.B · Answer 2 · 2021-07-15T16:13:12.830

2

A simple solution would be ascii():

string = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System ' \
         'eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

print(ascii(string))

output :

'Bitte \xfcberpr\xfcfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und l\xf6schen Sie dann die tats\xe4chlichen Dokumente.'

Also you can use unicode-escape and raw-unicode-escape to achive this (link) :

string = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System ' \
         'eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

print(string.encode('unicode-escape').decode('raw-unicode-escape'))

output :

Bitte \xfcberpr\xfcfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und l\xf6schen Sie dann die tats\xe4chlichen Dokumente.

Note : ascii() will escape non-ascii characters with \x , \u, \U for 1 byte, 2 bytes and 4 bytes respectively. In your case you see \x. But try this one :

print(ascii('س'))  # '\u0633'

If you really want to convert \xhh escape sequences to \u00hh , use re.sub() on result of ascii():

import re
print(re.sub(r'\\x[a-f0-9]{2}', lambda x: r'\u00' + x.group()[-2:].upper(), ascii(string)))

output :

'Bitte \u00FCberpr\u00FCfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und l\u00F6schen Sie dann die tats\u00E4chlichen Dokumente.'

Above approaches works for escaping any non-ascii characters, if you intend to escape just those three Germany's alphabets and there is no other non-ascii characters, take a look at str.translate() method.

edited Jul 15 '21 at 16:13

answered Jul 15 '21 at 10:31

S.B

13,077
10
22
49

Hi, thanks for the answer! While it works it produces "\x" character instead of "\u" characters, which will be the problem in the further process (the file will be proccessed by a different program/programming language). Any idea how to force "\u" character? E.g. 'ü' == '\u00FC' == '\xfc' – PiWo Jul 15 '21 at 10:45
@PiWo honestly I don't know when `ascii()` generates \x and when \u. In last case `print(ascii('س'))` we see \u. If you use those encodings I mentioned in the answer, you still having problem ? Doesn't it convert back to characters properly ? – S.B Jul 15 '21 at 10:49
The program that reads the final product has to have hard-coded conversion table for unicode character (it was written in Cobol, I have no sources) and it expects \u notation. [https://stackoverflow.com/a/46132950/1216005](https://stackoverflow.com/a/46132950/1216005) explains more less that it's just shorter notation - I'll have to figure something out. – PiWo Jul 15 '21 at 10:58
@PiWo I've added regex solution to convert those `\xhh`s to `\u00hh`. Now it's you expected result, but I'm not sure if this is the best solution. – S.B Jul 15 '21 at 11:48
I did something similiar with .replace('\x', '\u00'). It's not clean but it does the job for now as it is a unique problem for this particular situation. Thank you! – PiWo Jul 15 '21 at 11:53
The only caveat with this would be if the string originally contains `\x` which would be encoded as `\\x` but should _not_ be converted to `\\u00` – Jiří Baum Jul 15 '21 at 12:17
1

Also, the question asked for upper-case, `\u00FC` rather than `\u00fc`; not sure if that matters – Jiří Baum Jul 15 '21 at 12:40
(Remember to also update the example output for the updated code.) – Jiří Baum Jul 15 '21 at 15:51
With the replacement getting more complex, it slowly converges with the approach of skipping `ascii()` and doing a `re.sub()` on the original string... – Jiří Baum Jul 15 '21 at 15:54
1

@sabik lol. or even better `str.translate()`, If he just needs this escaping for Germany's alphabets. – S.B Jul 15 '21 at 16:04

How to escape unicode special chars in string and write it to UTF encoded file

2 Answers2