I have millions of strings scraped from web like:
s = 'WHAT\xe2\x80\x99S UP DOC?'
type(s) == str # returns True
Special characters like in the string above are inevitable when scraping from the web. How should one remove all such special characters to retain just clean text? I am thinking of regular expression like this based on my very limited experience with unicode characters:
\\x.*[0-9]