0

i wrote this regex for tokenize a text: "\b\w+\b"

but someone suggets me to convert it into \b[^\W\d_]+\b

can anyone explaing to me why this second way (using negation) is better?

thanks

Giacomo Ciampoli
  • 821
  • 3
  • 16
  • 33

1 Answers1

1

The first one matches all letters, numbers and the underscore. Depending on the regex engine, this may include unicode letters and numbers. (the word boundaries are superfluous in this case btw.)

The second regex matches only letters (excluding non-word-charcters, digits and the underscore). Due to the word boundary, it will only match them, if they are surrounded by non-word-characters or start/end of th string.

If your regex engine supports this, you might want to use [[:alpha:]] or \p{L} (or [A-Za-z] in case of non-unicode) instead to make your intent clearer.

Sebastian Proske
  • 8,255
  • 2
  • 28
  • 37