-1

I've been reading a book about NLP recently and in one part the author show me how to tokenize a piece of text.

And then he show me this code:

sent0 = """Thomas Jefferson began building Monticello at the age of 26."""
tokenizer = RegexpTokenizer(r'\w+|$[0-9]+|\S+')
print(tokenizer.tokenize(sent0))

What I don't understand is the meaning of this "r'\w+|$[0-9]+|\S+'". Can someone explain me just that?

5481393
  • 35
  • 1
  • 7
  • 1
    You can find here some information about the \w that means any word or char the | means or $[0-9] any number ...https://stackoverflow.com/questions/1576789/in-regex-what-does-w-mean – Ghassen Jun 29 '19 at 22:02

1 Answers1

1

Here's a great tool for interpreting RegEx: https://regex101.com/r/fLntOd/1

My response is directly excerpted from this page. This tool is a great playground for modifying your regex to see how it behaves differently in realtime.

r'\w+|$[0-9]+|\S+'

\w+ matches any word character (equal to [a-zA-Z0-9_])

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed

$ asserts position at the end of a line

Match a single character present in the list below [0-9]+

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) 0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)

\S+ matches any non-whitespace character (equal to [^\r\n\t\f\v ]) + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

' matches the character ' literally (case sensitive)

Global pattern flags g modifier: global. All matches (don't return after first match) m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

g000m
  • 125
  • 1
  • 12