0

I have a regex pattern to find multiple occurrences of a given word (supplied by the Database) in a text. The pattern also ignores the word if it is within a link.

The pattern has been working fine until now, but I have encountered a problem. If the word starts with an accent (and I specify starts because if it has the accent in the word that isn't a problem its only if it starts with an accent) the word will not match.

To view the problem go to RegexPal and paste this in the first box:

\bétest(?![^<]*</a>)\b

and this in the second box

herp derp derp test herp derp derp derp étest herp derp derp derp <a>test</a>

You can remove the "é" to see what it is its supposed to be returning

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103

1 Answers1

0

\b indicates a boundary between a word and a non-word character. Put another way, it asserts something akin to (?<=\w)(?=\W)|(?<=\W)(?=\w) (of course, lookbehinds aren't a thing in JavaScript, but this is just a demonstration).

é is not a word character. Therefore, there is no word boundary between it and the space.

However, there is a word boundary between it and the t:

"étest".match(/é\btest/)
> ["étest"]
Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592