Why "is" and "to" are removed by my regular expression in NLTK RegexpTokenizer()?

Question

I want to tokenize

s = ("mary went to garden. where is mary? "
     "mary is carrying apple and milk. "
     "what mary is carrying? apple,milk")

into

['mary', 'went', 'to', 'garden', '.', 
 'where', 'is', 'mary', '?', 
 'mary', 'is', 'carrying', 'apple', 'and', 'milk', '.', 
 'what', 'mary', 'is', 'carrying', '?', 'apple,milk']

Please note that I want to keep 'apple,milk' as one word.

My code is:

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+[\]|\w+[\,]\w+|\.|\?')
s = "mary went to garden. where is mary? mary is carrying apple and milk. what mary is carrying? apple,milk"
tokenizer.tokenize(s)

the result is:

['mary', 'went', 'garden', '.', 
 'where', 'mary', '?', 
 'mary', 'carrying', 'apple', 'and', 'milk', '.', 
 'what', 'mary', 'carrying', '?', 'apple,milk']

However, 'is' and 'to' are missing. How to keep them?

Please take a look https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python and https://www.kaggle.com/alvations/basic-nlp-with-nltk — alvas, Feb 06 '18 at 05:21
And https://stackoverflow.com/questions/2078915/a-regular-expression-to-exclude-a-word-string — alvas, Feb 06 '18 at 05:23

wp78de · Answer 1 · 2018-02-04T17:59:42.663

Your regex pattern simply does not capture the missing words.

You could see this whit a regex tool or using RegexpTokenizer('\w+[\]|\w+[\,]\w+|\.|\?', True) with an additional parameter to show gaps instead of tokens (doc).

Update:
Here is a pattern that finds all the tokens as specified by you:

\w+[\,]\w+|\w+|\.|\?

Remarks: When using regex alternatives it can be important to sort them by length (usually from longest to shortest). The [\] does not make sense to me and is syntactically not correct.

Online demo

score 1 · Answer 2 · answered Feb 06 '18 at 07:10

The RegexpTokenizer simply does a re.findall function given the input regex, from https://github.com/nltk/nltk/blob/develop/nltk/tokenize/regexp.py#L78

 def tokenize(self, text):
        self._check_regexp()
        # If our regexp matches gaps, use re.split:
        if self._gaps:
            if self._discard_empty:
                return [tok for tok in self._regexp.split(text) if tok]
            else:
                return self._regexp.split(text)

        # If our regexp matches tokens, use re.findall:
        else:
            return self._regexp.findall(text)

Essentially, you're doing:

>>> import re
>>> rg = re.compile(r'\w+[\]|\w+[\,]\w+|\.|\?')
>>> sent = "mary went to garden. where is mary? mary is carrying apple and milk. what mary is carrying? apple,milk" 
>>> rg.findall(sent)
['mary', 'went', 'garden', '.', 'where', 'mary', '?', 'mary', 'carrying', 'apple', 'and', 'milk', '.', 'what', 'mary', 'carrying', '?', 'apple,milk']

Looking at the explanation of the regex \w+[\]|\w+[\,]\w+|\.|\?: https://regex101.com/r/ail12t/1/

The regex has 3 alternatives:

\w+[\]|\w+[\,]\w+:
- First part \w+ matches any word character (equal to [a-zA-Z0-9_]) unlimited times
- Second part [\]|\w+[\,], is matches any word character within the \w+ range and also ], |, [ or , characters.
- Thirst part \w+ matches any word character (equal to [a-zA-Z0-9_]) unlimited times
\.:Fnd the . symbol and match it
\?: : TFind the ? symbol and match it

The reason why two character words gets "gobbled" up is because the multiple w+w+w+ in the first alternative of the \w+[\]|\w+[\,]\w+ regex. That means that the regex only catches/finds all the words that has a minimum of >=3 characters.

Actually, I think the regex can be further simplified and you can easily break it down into small units and piece them up.

With \w+, it will simply match all words and excludes punctuations:

>>> rg = re.compile(r'\w+')
>>> sent = "mary went to garden. where is mary? mary is carrying apple and milk. what mary is carrying? apple,milk" 
>>> rg.findall(sent)
['mary', 'went', 'to', 'garden', 'where', 'is', 'mary', 'mary', 'is', 'carrying', 'apple', 'and', 'milk', 'what', 'mary', 'is', 'carrying', 'apple', 'milk']

Then to catch the punctuations [[\]\,\-\|\.], simply add them as alternatives separated by |, i.e.

>>> rg = re.compile(r'\w+|[[\]\,\-\|\.]')
>>> rg.findall(sent)
['mary', 'went', 'to', 'garden', '.', 'where', 'is', 'mary', 'mary', 'is', 'carrying', 'apple', 'and', 'milk', '.', 'what', 'mary', 'is', 'carrying', 'apple', ',', 'milk']

Why "is" and "to" are removed by my regular expression in NLTK RegexpTokenizer()?

2 Answers2