This uses the idea that the debug representation (repr()
) of a text will show escape codes for non-printable characters, so it removes those escape codes (three types: \xnn, \unnnn, \Unnnnnnnn) and evaluates the result:
import re
import ast
text = '\x19\x40\u2002\u2019\U0001e526\U0001f235\\u1234\\U00012345\\xff\\\u2002'
# ^^^^ ^^^^^^ ^^^^^^^^^^ ^^^^^^
# To remove above, others are printable escape codes or literal backslashes.
# If preceded by an odd number of backslashes, it's an escape code.
print('printed text: ', text)
print('repr() text: ', repr(text))
clean_text = ast.literal_eval(re.sub(r'''(?x) # verbose mode
(?<!\\) # not preceded by literal backslash
((?:\\\\)*) # zero or more pairs literal backslashes (group 1)
\\ # match a literal backslash
(?: # non-capturing group
(?:x[0-9a-f]{2}) | # match an x and 2 hexadecimal digits OR
(?:u[0-9a-f]{4}) | # match a u and 4 hex digits OR
(?:U[0-9a-f]{8}) # match a U and 8 hex digits
) # end non-capturing group
''',
r'\1' # replace with group 1 (pairs of backslashes, if any)
, repr(text))) # string to operate on
print('cleaned text: ', clean_text)
print('cleaned repr(): ', repr(clean_text))
Output:
printed text: @ ’\u1234\U00012345\xff\
repr() text: '\x19@\u2002’\U0001e526\\u1234\\U00012345\\xff\\\u2002'
cleaned text: @’\u1234\U00012345\xff\
cleaned repr(): '@’\\u1234\\U00012345\\xff\\'
Note you may not want to remove all characters that display as escape codes. Their str()
(print display) vs. repr()
(debug display) may be something desirable. For example, \u2002
is an EN SPACE
(another type of SPACE character) and prints as a space. The debug representation only shows it as an escape code so you can tell the difference between an ASCII SPACE
and an EN SPACE
.