2

What I aim to achieve is to:

string like:

Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.

convert to:

'Bitte \u00FCberpr\u00FCfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und l\u00F6schen Sie dann die tats\u00E4chlichen Dokumente.'

and write it in this form to file (which is UTF-8 encoded).

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
PiWo
  • 590
  • 2
  • 8
  • 17

2 Answers2

3

Another solution, not relying on the built-in repr() but rather implementing it from scratch:

orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)

print(enc)

Differences:

  • Encodes only using \u, never any other sequence, whereas repr() uses about a third of the alphabet (so for example the BEL character will be encoded as \u0007 rather than \a)
  • Upper-case encoding, as specified (\u00FC rather than \u00fc)
  • Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
  • It does not take care of any pre-existing \u sequences, whereas repr() turns those into \\u; could be extended, perhaps to encode \ as \u005C:
    enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)
    
Jiří Baum
  • 6,697
  • 2
  • 17
  • 17
  • This is very clean solution. Thank you :) Exactly what I needed. It actually converts some character that the "ascii" method did not, e.g. : „” – PiWo Jul 16 '21 at 07:27
2

A simple solution would be ascii():

string = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System ' \
         'eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

print(ascii(string))

output :

'Bitte \xfcberpr\xfcfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und l\xf6schen Sie dann die tats\xe4chlichen Dokumente.'

Also you can use unicode-escape and raw-unicode-escape to achive this (link) :

string = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System ' \
         'eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

print(string.encode('unicode-escape').decode('raw-unicode-escape'))

output :

Bitte \xfcberpr\xfcfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und l\xf6schen Sie dann die tats\xe4chlichen Dokumente.

Note : ascii() will escape non-ascii characters with \x , \u, \U for 1 byte, 2 bytes and 4 bytes respectively. In your case you see \x. But try this one :

print(ascii('س'))  # '\u0633'

If you really want to convert \xhh escape sequences to \u00hh , use re.sub() on result of ascii():

import re
print(re.sub(r'\\x[a-f0-9]{2}', lambda x: r'\u00' + x.group()[-2:].upper(), ascii(string))) 

output :

'Bitte \u00FCberpr\u00FCfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und l\u00F6schen Sie dann die tats\u00E4chlichen Dokumente.'

Above approaches works for escaping any non-ascii characters, if you intend to escape just those three Germany's alphabets and there is no other non-ascii characters, take a look at str.translate() method.

S.B
  • 13,077
  • 10
  • 22
  • 49
  • Hi, thanks for the answer! While it works it produces "\x" character instead of "\u" characters, which will be the problem in the further process (the file will be proccessed by a different program/programming language). Any idea how to force "\u" character? E.g. 'ü' == '\u00FC' == '\xfc' – PiWo Jul 15 '21 at 10:45
  • @PiWo honestly I don't know when `ascii()` generates \x and when \u. In last case `print(ascii('س'))` we see \u. If you use those encodings I mentioned in the answer, you still having problem ? Doesn't it convert back to characters properly ? – S.B Jul 15 '21 at 10:49
  • The program that reads the final product has to have hard-coded conversion table for unicode character (it was written in Cobol, I have no sources) and it expects \u notation. [https://stackoverflow.com/a/46132950/1216005](https://stackoverflow.com/a/46132950/1216005) explains more less that it's just shorter notation - I'll have to figure something out. – PiWo Jul 15 '21 at 10:58
  • @PiWo I've added regex solution to convert those `\xhh`s to `\u00hh`. Now it's you expected result, but I'm not sure if this is the best solution. – S.B Jul 15 '21 at 11:48
  • I did something similiar with .replace('\x', '\u00'). It's not clean but it does the job for now as it is a unique problem for this particular situation. Thank you! – PiWo Jul 15 '21 at 11:53
  • The only caveat with this would be if the string originally contains `\x` which would be encoded as `\\x` but should _not_ be converted to `\\u00` – Jiří Baum Jul 15 '21 at 12:17
  • 1
    Also, the question asked for upper-case, `\u00FC` rather than `\u00fc`; not sure if that matters – Jiří Baum Jul 15 '21 at 12:40
  • (Remember to also update the example output for the updated code.) – Jiří Baum Jul 15 '21 at 15:51
  • With the replacement getting more complex, it slowly converges with the approach of skipping `ascii()` and doing a `re.sub()` on the original string... – Jiří Baum Jul 15 '21 at 15:54
  • 1
    @sabik lol. or even better `str.translate()`, If he just needs this escaping for Germany's alphabets. – S.B Jul 15 '21 at 16:04