Custom word tokenizer

Question

I'm analyzing twitter data for sentiment analysis, and I need to tokenize tweets for my analysis.

Let this be an example tweet:

tweet = "Barça, que más veces ha jugado contra 10 en la historia https://twitter.com/7WUjZrMJah #UCL"

The nltk.word_tokenize() tokenizes the tweets alright but breaks at links and hashtags.

word_tokenize(tweet)

>>> ['Bar\xc3\xa7a', ',', 'que', 'm\xc3\xa1s', 'veces', 'ha', 'jugado', 'contra', '10', 'en', 'la', 'historia', 'https', ':', '//twitter.com/7WUjZrMJah', '#', 'UCL']`

The unicode characters remain intact, but links are broken. I have designed a custom regex tokenizer, which is:

emoticons = r'(?:[:;=\^\-oO][\-_\.]?[\)\(\]\[\-DPOp_\^\\\/])'

regex_tweets = [
    emoticons,
    r'<[^>]+>',      ## HTML TAGS
    r'(?:@[\w\d_]+)',   ## @-mentions
    r'(?:\#[\w]+)',  ## #HashTags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:(?:\d+,?)+(?:\.?\d+)?)',  ##numbers
    r'(?:[\w_]+)',   #other words
    r'(?:\S)'        ## normal text 
]

#compiling regex
tokens_re = re.compile(r'('+'|'.join(regex_tweets)+')' ,re.IGNORECASE | re.VERBOSE)
tokens_re.findall(string)

>>> ['Bar', '\xc3', '\xa7', 'a', ',', 'que', 'm', '\xc3', '\xa1', 's', 'veces', 'ha', 'jugado', 'contra', '10', 'en', 'la', 'historia', 'https://twitter.com/7WUjZrMJah', '#UCL']

Now the hashtags and links appear the way I want them to, but breaks at unicode charachters (like Barça -> ['Bar', '\xc3', '\xa7', 'a'] instead of ['Bar\xc3\xa7a']

Is there any way I can integrate both of these?? Or a regular expression that includes unicode characters??

I have also tried TweetTokenizer from the nltk.tokenize library, but it wasn't very useful.

You need to also specify the `re.U` flag. `re.IGNORECASE | re.VERBOSE | re.UNICODE`. Also note that `[\w\d_]+` = `\w+`. Also, this `(?:(?:\d+,?)+(?:\.?\d+)?)` looks fragile. — Wiktor Stribiżew, Apr 06 '16 at 19:21
@WiktorStribiżew I'll go through the regular expressions again later to optimise them and make them more robust, the regular expressions are working fine for now, I need to catch the unicode words. — Krishh, Apr 06 '16 at 19:27
@WiktorStribiżew Yes, it's python 2.7. and I have added `# coding=utf-8` at the top if that's what you meant... — Krishh, Apr 06 '16 at 19:34
No, something [like this](http://stackoverflow.com/questions/32863608/regex-python-with-unicode-japanese-character-issue/32868484#32868484). — Wiktor Stribiżew, Apr 06 '16 at 19:35
@WiktorStribiżew I didn't do that, now it seems to be working the way I desired it to. Its still breaking at a few characters, but it's better than before! Thanks! — Krishh, Apr 06 '16 at 19:55

score 0 · Accepted Answer · answered Apr 06 '16 at 22:00

Turns out most of the unicode characters don't break if I declare the string as a unicode string. It still breaks at many words, but the performance is better.

# coding=utf-8

tweet = u"Barça, que más veces ha jugado contra 10 en la historia https://twitter.com/7WUjZrMJah #UCL"

emoticons = r'(?:[:;=\^\-oO][\-_\.]?[\)\(\]\[\-DPOp_\^\\\/])'

regex_tweets = [
    emoticons,
    r'<[^>]+>',      ## HTML TAGS
    r'(?:@[\w\d_]+)',   ## @-mentions
    r'(?:\#[\w]+)',  ## #HashTags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:(?:\d+,?)+(?:\.?\d+)?)',  ##numbers
    r'(?:[\w_]+)',   #other words
    r'(?:\S)'        ## normal text 
]

#compiling regex
tokens_re = re.compile(r'('+'|'.join(regex_tweets)+')' ,re.IGNORECASE | re.VERBOSE)
tokens_re.findall(string)

>>>[u'Bar', u'\xe7a', u',', u'que', u'm\xe1s', u'veces', u'ha', u'jugado', u'contra', u'10', u'en', u'la', u'historia', u'https://twitter.com/7WUjZrMJah', u'#UCL']

It still tokenised Barça to [u'Bar', u'\xe7a'], which is better than ['Bar', '\xc3', '\xa7', 'a'], but still not the original term ['Bar\xc3\xa7a']. But it does work on many expressions.

Custom word tokenizer

1 Answers1