0
print(nltk.regexp_tokenize('That U.S.A. poster-print costs $12.40...13.10', r"((?:(?:[A-Z]\.)+)|(?:\w+(?:-\w+)*)|(?:\d+(?:\.\d+)?))"))

Outputs:

"['That', 'U.S.A.', 'poster-print', 'costs', '12', '40', '13', '10']"

And (change in the order of the patterns in parentheses):

print(nltk.regexp_tokenize('That U.S.A. poster-print costs $12.40...13.10', r"((?:(?:[A-Z]\.)+)|(?:\d+(?:\.\d+)?)|(?:\w+(?:-\w+)*))"))

Outputs:

['That', 'U.S.A.', 'poster-print', 'costs', '12.40', '13.10']

Why the order in this case matters?

0 Answers0