The RegexpTokenizer
simply does a re.findall
function given the input regex, from https://github.com/nltk/nltk/blob/develop/nltk/tokenize/regexp.py#L78
def tokenize(self, text):
self._check_regexp()
# If our regexp matches gaps, use re.split:
if self._gaps:
if self._discard_empty:
return [tok for tok in self._regexp.split(text) if tok]
else:
return self._regexp.split(text)
# If our regexp matches tokens, use re.findall:
else:
return self._regexp.findall(text)
Essentially, you're doing:
>>> import re
>>> rg = re.compile(r'\w+[\]|\w+[\,]\w+|\.|\?')
>>> sent = "mary went to garden. where is mary? mary is carrying apple and milk. what mary is carrying? apple,milk"
>>> rg.findall(sent)
['mary', 'went', 'garden', '.', 'where', 'mary', '?', 'mary', 'carrying', 'apple', 'and', 'milk', '.', 'what', 'mary', 'carrying', '?', 'apple,milk']
Looking at the explanation of the regex \w+[\]|\w+[\,]\w+|\.|\?
: https://regex101.com/r/ail12t/1/
The regex has 3 alternatives:
The reason why two character words gets "gobbled" up is because the multiple w+w+w+
in the first alternative of the \w+[\]|\w+[\,]\w+
regex. That means that the regex only catches/finds all the words that has a minimum of >=3 characters.
Actually, I think the regex can be further simplified and you can easily break it down into small units and piece them up.
With \w+
, it will simply match all words and excludes punctuations:
>>> rg = re.compile(r'\w+')
>>> sent = "mary went to garden. where is mary? mary is carrying apple and milk. what mary is carrying? apple,milk"
>>> rg.findall(sent)
['mary', 'went', 'to', 'garden', 'where', 'is', 'mary', 'mary', 'is', 'carrying', 'apple', 'and', 'milk', 'what', 'mary', 'is', 'carrying', 'apple', 'milk']
Then to catch the punctuations [[\]\,\-\|\.]
, simply add them as alternatives separated by |
, i.e.
>>> rg = re.compile(r'\w+|[[\]\,\-\|\.]')
>>> rg.findall(sent)
['mary', 'went', 'to', 'garden', '.', 'where', 'is', 'mary', 'mary', 'is', 'carrying', 'apple', 'and', 'milk', '.', 'what', 'mary', 'is', 'carrying', 'apple', ',', 'milk']