Can someone explain me what is the meaning of all the code inside these parentheses: RegexpTokenizer(r'\w+|$[0-9]+|\S+')?

Question

I've been reading a book about NLP recently and in one part the author show me how to tokenize a piece of text.

And then he show me this code:

sent0 = """Thomas Jefferson began building Monticello at the age of 26."""
tokenizer = RegexpTokenizer(r'\w+|$[0-9]+|\S+')
print(tokenizer.tokenize(sent0))

What I don't understand is the meaning of this "r'\w+|$[0-9]+|\S+'". Can someone explain me just that?

You can find here some information about the \w that means any word or char the | means or $[0-9] any number ...https://stackoverflow.com/questions/1576789/in-regex-what-does-w-mean — Ghassen, Jun 29 '19 at 22:02

score 1 · Accepted Answer · answered Jun 29 '19 at 22:18

Here's a great tool for interpreting RegEx: https://regex101.com/r/fLntOd/1

My response is directly excerpted from this page. This tool is a great playground for modifying your regex to see how it behaves differently in realtime.

r'\w+|$[0-9]+|\S+'

\w+ matches any word character (equal to [a-zA-Z0-9_])

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed

$ asserts position at the end of a line

Match a single character present in the list below [0-9]+

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) 0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)

\S+ matches any non-whitespace character (equal to [^\r\n\t\f\v ]) + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

' matches the character ' literally (case sensitive)

Global pattern flags g modifier: global. All matches (don't return after first match) m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

Can someone explain me what is the meaning of all the code inside these parentheses: RegexpTokenizer(r'\w+|$[0-9]+|\S+')?

1 Answers1