I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat
. I specify that it must start a word, so catering
will match as cat
is at the start, but ducat
won't match as cat
doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|\b)@nimal/i", "something@nimal", $match);
preg_match("/(^|\b)@nimal/i", "something!@nimal", $match);
In the statements above I would expect the following results,
> false
> 1 (@nimal)
But the result is instead the opposite,
> 1 (@nimal)
> false
In the first, I would expect it to fail as the group will eat the @
, leaving nimal
to match against @nimal
, which obviously it doesn't. Instead, the group matchs an empty string, so @nimal
is matched, meaning @
is considered to be part of the word.
In the second, I would expect the group to eat the !
leaving @nimal
to match the rest (which it should). Instead, it appears to combine the !
and @
together to form a word, which is confirmed by the following matching,
preg_match("/g\b!@\bn/i", "something!@nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.