negation classes regex

Question

i wrote this regex for tokenize a text: "\b\w+\b"

but someone suggets me to convert it into \b[^\W\d_]+\b

can anyone explaing to me why this second way (using negation) is better?

thanks

`[^\W\d_]` excludes digits and underscores. Do you want to exclude them or not? — Ry-, Oct 21 '17 at 20:44

Sebastian Proske · Accepted Answer · 2017-10-21T20:56:06.147

The first one matches all letters, numbers and the underscore. Depending on the regex engine, this may include unicode letters and numbers. (the word boundaries are superfluous in this case btw.)

The second regex matches only letters (excluding non-word-charcters, digits and the underscore). Due to the word boundary, it will only match them, if they are surrounded by non-word-characters or start/end of th string.

If your regex engine supports this, you might want to use [[:alpha:]] or \p{L} (or [A-Za-z] in case of non-unicode) instead to make your intent clearer.

negation classes regex

1 Answers1