Selective replacement of unicode characters in Python using regex

Question

There are many answers as to how one can use regex to remove unicode characters in Python.

See Remove Unicode code (\uxxx) in string Python and Python regex module "re" match unicode characters with \u

However, in my case, I don't want to replace every unicode character but only the ones that are displayed with their \u code, not the ones that are properly shown as characters. I have tried both solutions and they remove both types of unicode characters.

\u2002pandemic becomes pandemic and master’s becomes masters

Is there a general solutions to removing the first type of unicode characters but keeping the second kind?

What do you have against those characters? – Kelly Bundy Mar 17 '23 at 16:34 — Kelly Bundy, Mar 17 '23 at 16:34

Mark Tolonen · Accepted Answer · 2023-03-17T17:36:56.413

This uses the idea that the debug representation (repr()) of a text will show escape codes for non-printable characters, so it removes those escape codes (three types: \xnn, \unnnn, \Unnnnnnnn) and evaluates the result:

import re
import ast

text = '\x19\x40\u2002\u2019\U0001e526\U0001f235\\u1234\\U00012345\\xff\\\u2002'
#       ^^^^    ^^^^^^      ^^^^^^^^^^                                   ^^^^^^
# To remove above, others are printable escape codes or literal backslashes.
# If preceded by an odd number of backslashes, it's an escape code.
print('printed text:   ', text)
print('repr() text:    ', repr(text))
clean_text = ast.literal_eval(re.sub(r'''(?x)                # verbose mode
                                         (?<!\\)             # not preceded by literal backslash
                                         ((?:\\\\)*)         # zero or more pairs literal backslashes (group 1)
                                         \\                  # match a literal backslash
                                         (?:                 # non-capturing group
                                         (?:x[0-9a-f]{2}) |  # match an x and 2 hexadecimal digits OR
                                         (?:u[0-9a-f]{4}) |  # match a u and 4 hex digits OR
                                         (?:U[0-9a-f]{8})    # match a U and 8 hex digits
                                         )                   # end non-capturing group
                                         ''',
                                         r'\1'               # replace with group 1 (pairs of backslashes, if any)
                                         , repr(text)))      # string to operate on
print('cleaned text:   ', clean_text)
print('cleaned repr(): ', repr(clean_text))

Output:

printed text:    @ ’\u1234\U00012345\xff\ 
repr() text:     '\x19@\u2002’\U0001e526\\u1234\\U00012345\\xff\\\u2002'
cleaned text:    @’\u1234\U00012345\xff\
cleaned repr():  '@’\\u1234\\U00012345\\xff\\'

Note you may not want to remove all characters that display as escape codes. Their str() (print display) vs. repr() (debug display) may be something desirable. For example, \u2002 is an EN SPACE (another type of SPACE character) and prints as a space. The debug representation only shows it as an escape code so you can tell the difference between an ASCII SPACE and an EN SPACE.

This works for me however it does add double quotes around strings, should I simply remove them after? — user1627466, Mar 17 '23 at 16:52
@user1627466 printing with `repr()` adds the quotes. See `cleaned text: ` line. — Mark Tolonen, Mar 17 '23 at 16:54
Might be correct now :-). I think instead of the `(?<!\\)`, you could instead put `|` between the backslash-pairs and the rest. — Kelly Bundy, Mar 17 '23 at 17:47

score 0 · Answer 2 · answered Mar 17 '23 at 17:53

0

There's isprintable exactly for this type of thing:

src = 'a \u200d \x1f Ü ß ы'

cleaned = ''.join(c for c in src if c.isprintable())

print(repr(src))
print(repr(cleaned))

# 'a \u200d \x1f Ü ß ы'
# 'a   Ü ß ы'

answered Mar 17 '23 at 17:53

gog

10,367
2
24
38

Or `cleaned = ''.join(filter(str.isprintable, src))`. – Kelly Bundy Mar 17 '23 at 18:34

Selective replacement of unicode characters in Python using regex

2 Answers2