0

I'm trying to delete some hex (such as \xc3) from strings of text. I plan to use regular expressions to help get rid of those. Here is my code:

import re
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'    
tweet1 = re.sub(r'\\x[a-f0-9]{2}', '', tweet)
print(tweet1)

However, instead of deleting the output I actually get the encoded version of hex. Here is my output:

b"[/Very seldom~ will someone enter your life] to questionââ¬Â¦ "

Does somebody know how I can get rid of those hex strings?... Thanks in advance.

norpa
  • 127
  • 2
  • 16

3 Answers3

0

Try tweet1.decode('ascii','ignore') after applying the regex.

vendaTrout
  • 146
  • 7
  • I get this error: `AttributeError: 'str' object has no attribute 'decode'`. Should I encode instead? – norpa Feb 14 '17 at 12:34
  • Yes. `tweet1.encode('ascii','ignore')`. The decode function has been removed in python 3.x. My bad, although you should mention that this is a python 3.x question in the tags. – vendaTrout Feb 17 '17 at 03:35
0

You can try something like this:

import re
import string

tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6"'    
tweet1 = re.sub(r'[^\w\s{}]'.format(string.punctuation), '', tweet)
print(tweet1)

Output:

b"[Very seldom~ will someone enter your life] to question"

Regex:

[^\w\s{}] - Match everything that is not a \w, \s or a punctuation character.

Mohammad Yusuf
  • 16,554
  • 10
  • 50
  • 78
  • I still get the same output. Here it is: `b"[/Very seldom~ will someone enter your life] to questionÃâ "` Any idea what I can do? – norpa Feb 14 '17 at 12:40
0

Actually, the issue is how I modeled the problem. tweet doesn't contain the literal characters \xc3\xa2..., it actually encodes them when declaring the string. So the regex is looking for the string \xc3, but what tweet contains in that position is actually Ã

The solution is to encode in utf8 and then convert to string, to finally use regex to get rid of the hex. I got the lead in this post (look the first answer by Martijn Pieters): python regex: how to remove hex dec characters from string

Community
  • 1
  • 1
norpa
  • 127
  • 2
  • 16