0

There is a symbol in tweets :

“@BrownieSWP: High is s***????” you like 12 tf

The symbol is not ". I write this regex to match it:

re.sub('(“|”)', '"', tweet)

This regex (“|”) worked in sublime text. But it didn't work in python.

Luke Willis
  • 8,429
  • 4
  • 46
  • 79
stamaimer
  • 6,227
  • 5
  • 34
  • 55

2 Answers2

3

The character you have copy/pasted is a U+201C "LEFT DOUBLE QUOTATION MARK". In the re.sub() you also have the corresponding right quotation mark U+201D. Perhaps the environment in which you tried to paste it wasn't set up to handle Unicode correctly, and converted it to some other encoding. (See also How do I see the current encoding of a file in Sublime Text 2?)

You can always use Python's escape codes to unambiguously and ASCII-compatibly refer to a Unicode character; re.sub(u'[\u201c\u201d]', '', tweet)

Community
  • 1
  • 1
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • This is the method I prefer if there's ever a situation where Unicode character support in editors may be questioned. From annoying previous experience, it only took one misconfigured editor to screw up an entire source file. – Rejected Aug 28 '14 at 16:08
1

It works for me,

>>> s = r"“@BrownieSWP: High is s***????” you like 12 tf"
>>> m = re.sub(r'[”“]', r'', s)
>>> m
'@BrownieSWP: High is s***???? you like 12 tf'
Luke Willis
  • 8,429
  • 4
  • 46
  • 79
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274