There is a symbol in tweets : “
“@BrownieSWP: High is s***????” you like 12 tf
The symbol is not "
. I write this regex to match it:
re.sub('(“|”)', '"', tweet)
This regex (“|”)
worked in sublime text. But it didn't work in python.
There is a symbol in tweets : “
“@BrownieSWP: High is s***????” you like 12 tf
The symbol is not "
. I write this regex to match it:
re.sub('(“|”)', '"', tweet)
This regex (“|”)
worked in sublime text. But it didn't work in python.
The character you have copy/pasted is a U+201C "LEFT DOUBLE QUOTATION MARK". In the re.sub()
you also have the corresponding right quotation mark U+201D. Perhaps the environment in which you tried to paste it wasn't set up to handle Unicode correctly, and converted it to some other encoding. (See also How do I see the current encoding of a file in Sublime Text 2?)
You can always use Python's escape codes to unambiguously and ASCII-compatibly refer to a Unicode character; re.sub(u'[\u201c\u201d]', '', tweet)
It works for me,
>>> s = r"“@BrownieSWP: High is s***????” you like 12 tf"
>>> m = re.sub(r'[”“]', r'', s)
>>> m
'@BrownieSWP: High is s***???? you like 12 tf'