I was doing some twitter mining, and pulled the json of tweets into python3 via pandas
before processing further, i noticed alot of the data was not consistent/clean or even useful to me (for now) so i used regex to make the string of tweet messages consistent or delete the offending item
below is that:
data['full_text'] = data['full_text'].replace('^@ABC(\\u2019s)*[ ,\n\t]*', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\n)', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\t)', '.', regex=True)
data['full_text'] = data['full_text'].replace('(\\u2018)|(\\u2019)', "'",
regex=True)
data['full_text'] = data['full_text'].replace('(\\u201c)|(\\u201d)', "\"", regex=True)
data['full_text'] = data['full_text'].replace('(\\n)|(\\t)', '', regex=True)
i.e. - remove twitter handle if used at beginning (including punctuation linked to it) - json should have no issue with apostrophes. Keep everything consistent and replace unicode for left/right apostrophe with single ' -some tweets have backslash for quote, others use unicode. keep consistent and replace unicode with \" -delete all tabs -assume all new lines are new sentences so replace them with a fullstop
as far as I'm aware, this this is all that is really needed. things like ~ are likely to be useless, with no real purpose to them. The tweets will also have emoticons that i dont care about (for now)
the rest of the punctuation and these emoticons follow the format \uXXXX where x is a number or letter
so my last line was planning to be the below:
data['full_text'] = data['full_text'].replace('(\\u\w\w\w\w)', "", regex=True)
given the large number of tweets i have, i cant verify if everything worked correctly, which is why if anyone could give some advice?
From my research i kept seeing people post things like:
([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF])
but when i try these, i also still see emoticons etc left in the json. So why not just use \u\w\w\w\w ??? (especially when used at the end?)
===================================================================== update:
data['full_text'] = data['full_text'].replace('^@ABC(\\u2019s)*[ ,\n\t]*', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\n)', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\t)', '.', regex=True)
data['full_text'] = data['full_text'].replace('(\\u2018)|(\\u2019)', "'", regex=True)
data['full_text'] = data['full_text'].replace('(\\u201c)|(\\u201d)', "\"", regex=True)
data['full_text'] = data['full_text'].replace('https:\/\/t.co\/(\w{10})', "", regex=True)
import string
data['full_text'] = data['full_text'].replace('[^{}]'.format(string.printable), '', regex=True)
It works thanks to James, although I'm getting conflicting information. Is the last line appropriate? is it deleting anything more than just unicode?