I'm analyzing twitter data for sentiment analysis, and I need to tokenize tweets for my analysis.
Let this be an example tweet:
tweet = "Barça, que más veces ha jugado contra 10 en la historia https://twitter.com/7WUjZrMJah #UCL"
The nltk.word_tokenize()
tokenizes the tweets alright but breaks at links and hashtags.
word_tokenize(tweet)
>>> ['Bar\xc3\xa7a', ',', 'que', 'm\xc3\xa1s', 'veces', 'ha', 'jugado', 'contra', '10', 'en', 'la', 'historia', 'https', ':', '//twitter.com/7WUjZrMJah', '#', 'UCL']`
The unicode characters remain intact, but links are broken. I have designed a custom regex tokenizer, which is:
emoticons = r'(?:[:;=\^\-oO][\-_\.]?[\)\(\]\[\-DPOp_\^\\\/])'
regex_tweets = [
emoticons,
r'<[^>]+>', ## HTML TAGS
r'(?:@[\w\d_]+)', ## @-mentions
r'(?:\#[\w]+)', ## #HashTags
r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
r'(?:(?:\d+,?)+(?:\.?\d+)?)', ##numbers
r'(?:[\w_]+)', #other words
r'(?:\S)' ## normal text
]
#compiling regex
tokens_re = re.compile(r'('+'|'.join(regex_tweets)+')' ,re.IGNORECASE | re.VERBOSE)
tokens_re.findall(string)
>>> ['Bar', '\xc3', '\xa7', 'a', ',', 'que', 'm', '\xc3', '\xa1', 's', 'veces', 'ha', 'jugado', 'contra', '10', 'en', 'la', 'historia', 'https://twitter.com/7WUjZrMJah', '#UCL']
Now the hashtags and links appear the way I want them to, but breaks at unicode charachters (like Barça -> ['Bar', '\xc3', '\xa7', 'a']
instead of ['Bar\xc3\xa7a']
Is there any way I can integrate both of these?? Or a regular expression that includes unicode characters??
I have also tried TweetTokenizer
from the nltk.tokenize
library, but it wasn't very useful.