0

I am looking at this example

>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+')
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

Is there any difference between RegexpTokenizer syntax and Python regular expressions? For example, what does:

$[\d\.] 

stand for? From here we learn that \d matches any decimal digit.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Richard Rublev
  • 7,718
  • 16
  • 77
  • 121
  • 1
    Do you mean the [NLTK `RegexpTokenizer`](http://www.nltk.org/api/nltk.tokenize.html?highlight=regexptokenizer#nltk.tokenize.regexp.RegexpTokenizer)? Do you have a reason to think they're using a different regex syntax to `re` (and every other language that uses the standard syntax)? It seems pretty unlikely you'd use some other syntax and still call it regex, as that would lead to enormous confusion. – jonrsharpe May 13 '17 at 10:44
  • @jonsharpe kudos to linking the uber duped thread!! – alvas May 13 '17 at 12:19

0 Answers0