Why it appears that regexp' OR is order sensitive?

Asked Apr 20 '18 at 19:23

Active Apr 20 '18 at 19:23

Viewed 28 times

print(nltk.regexp_tokenize('That U.S.A. poster-print costs $12.40...13.10', r"((?:(?:[A-Z]\.)+)|(?:\w+(?:-\w+)*)|(?:\d+(?:\.\d+)?))"))

Outputs:

"['That', 'U.S.A.', 'poster-print', 'costs', '12', '40', '13', '10']"

And (change in the order of the patterns in parentheses):

print(nltk.regexp_tokenize('That U.S.A. poster-print costs $12.40...13.10', r"((?:(?:[A-Z]\.)+)|(?:\d+(?:\.\d+)?)|(?:\w+(?:-\w+)*))"))

Outputs:

['That', 'U.S.A.', 'poster-print', 'costs', '12.40', '13.10']

Why the order in this case matters?

asked Apr 20 '18 at 19:23

Maciej Wasilewski

2

Because `\w` also matches what `\d` matches. – Wiktor Stribiżew Apr 20 '18 at 19:25
Regex just work this way. First match in an or-ed regex wins. – Michael Butscher Apr 20 '18 at 19:26

0 Answers0