-1
text="\xe2\x80\x94"
print re.sub(r'(\\(?<=\\)x[a-z0-9]{2})+',"replacement_text",text)

output is

how can I handle the hex decimal characters in this situation?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
aman
  • 1,875
  • 4
  • 18
  • 27
  • 1
    Your input doesn't *have* backslashes. It has 3 bytes, the UTF-8 encoding for U+2014 EM DASH. – Martijn Pieters Feb 25 '16 at 09:42
  • 2
    Dude, `text` **is not** literally `\xe2\x80\x94`. `\x` is a special character that tells Python that next two characters will be interpreted as hex digits for some character code. – freakish Feb 25 '16 at 09:43

2 Answers2

2

Your input doesn't have backslashes. It has 3 bytes, the UTF-8 encoding for the U+2014 EM DASH character:

>>> text = "\xe2\x80\x94"
>>> len(text)
3
>>> text[0]
'\xe2'
>>> text.decode('utf8')
u'\u2014'
>>> print text.decode('utf8')
—

You either need to match those UTF-8 bytes directly, or decode from UTF-8 to unicode and match the codepoint. The latter is preferable; always try to deal with text as Unicode to simplify how many characters you have to transform at a time.

Also note that Python's repr() output (which is used impliciltly when echoing in the interactive interpreter or when printing lists, dicts or other containers) uses \xhh escape sequences to represent any non-printable character. For UTF-8 strings, that includes anything outside the ASCII range. You could just replace anything outside that range with:

re.sub(r'[\x80-\xff]+', "replacement_text", text)

Take into account that this'll match multiple UTF-8-encoded characters in a row, and replace these together as a group!

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

Your input is in hex, not an actual "\xe2\x80\x94". \x is just the way to say that the following characters should be interpreted in hex.

This was explained in this post.

Community
  • 1
  • 1
Isdj
  • 1,835
  • 1
  • 18
  • 36