RegexpTokenizer syntax $[\d\.]?

Asked May 13 '17 at 10:40

Active May 13 '17 at 10:45

Viewed 24 times

I am looking at this example

>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+')
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

Is there any difference between RegexpTokenizer syntax and Python regular expressions? For example, what does:

$[\d\.]

stand for? From here we learn that \d matches any decimal digit.

edited May 13 '17 at 10:45

jonrsharpe

115,751
26
228
437

asked May 13 '17 at 10:40

Richard Rublev

7,718
16
77
121

1

Do you mean the [NLTK `RegexpTokenizer`](http://www.nltk.org/api/nltk.tokenize.html?highlight=regexptokenizer#nltk.tokenize.regexp.RegexpTokenizer)? Do you have a reason to think they're using a different regex syntax to `re` (and every other language that uses the standard syntax)? It seems pretty unlikely you'd use some other syntax and still call it regex, as that would lead to enormous confusion. – jonrsharpe May 13 '17 at 10:44
@jonsharpe kudos to linking the uber duped thread!! – alvas May 13 '17 at 12:19

RegexpTokenizer syntax $[\d\.]?

0 Answers0