python regex: how to remove hex dec characters from string

Question

text="\xe2\x80\x94"
print re.sub(r'(\\(?<=\\)x[a-z0-9]{2})+',"replacement_text",text)

output is —

how can I handle the hex decimal characters in this situation?

Your input doesn't *have* backslashes. It has 3 bytes, the UTF-8 encoding for U+2014 EM DASH. — Martijn Pieters, Feb 25 '16 at 09:42
Dude, `text` **is not** literally `\xe2\x80\x94`. `\x` is a special character that tells Python that next two characters will be interpreted as hex digits for some character code. — freakish, Feb 25 '16 at 09:43

Martijn Pieters · Answer 1 · 2016-02-25T09:55:56.557

Your input doesn't have backslashes. It has 3 bytes, the UTF-8 encoding for the U+2014 EM DASH character:

>>> text = "\xe2\x80\x94"
>>> len(text)
3
>>> text[0]
'\xe2'
>>> text.decode('utf8')
u'\u2014'
>>> print text.decode('utf8')
—

You either need to match those UTF-8 bytes directly, or decode from UTF-8 to unicode and match the codepoint. The latter is preferable; always try to deal with text as Unicode to simplify how many characters you have to transform at a time.

Also note that Python's repr() output (which is used impliciltly when echoing in the interactive interpreter or when printing lists, dicts or other containers) uses \xhh escape sequences to represent any non-printable character. For UTF-8 strings, that includes anything outside the ASCII range. You could just replace anything outside that range with:

re.sub(r'[\x80-\xff]+', "replacement_text", text)

Take into account that this'll match multiple UTF-8-encoded characters in a row, and replace these together as a group!

score 0 · Answer 2 · edited May 23 '17 at 12:31

0

Your input is in hex, not an actual "\xe2\x80\x94". \x is just the way to say that the following characters should be interpreted in hex.

This was explained in this post.

edited May 23 '17 at 12:31

Community

1
1

answered Feb 25 '16 at 09:45

Isdj

1,835
1
18
36

python regex: how to remove hex dec characters from string

2 Answers2

Linked