12

I'm currently writing a library for matching specific words in content.

Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.

A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.

I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.

Take the following,

preg_match("/(^|\b)@nimal/i", "something@nimal", $match);
preg_match("/(^|\b)@nimal/i", "something!@nimal", $match);

In the statements above I would expect the following results,

> false
> 1 (@nimal)

But the result is instead the opposite,

> 1 (@nimal)
> false

In the first, I would expect it to fail as the group will eat the @, leaving nimal to match against @nimal, which obviously it doesn't. Instead, the group matchs an empty string, so @nimal is matched, meaning @ is considered to be part of the word.

In the second, I would expect the group to eat the ! leaving @nimal to match the rest (which it should). Instead, it appears to combine the ! and @ together to form a word, which is confirmed by the following matching,

preg_match("/g\b!@\bn/i", "something!@nimal", $match);

Any ideas why regular expression does this?

I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.

TylerH
  • 20,799
  • 66
  • 75
  • 101
Stephen Melrose
  • 4,772
  • 5
  • 29
  • 42

3 Answers3

21

The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your @ which is a \W character. So to match you need a word character before your @

something@nimal
        ^^

==> Match because of the word boundary between g and @.

something!@nimal
         ^^ 

==> NO match because between ! and @ there is no word boundary, both characters are \W

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
stema
  • 90,351
  • 20
  • 107
  • 135
  • As @hakre said in his comment, this is how PCRE does word boundaries ([src](http://php.net/manual/en/regexp.reference.escape.php)). Thank you for the clarification. – Stephen Melrose Jun 30 '11 at 08:27
  • Yup, this is correct answer. Note that I took the liberty to emphasize the fact that `\b` does not match a character, but a position. Feel free to roll-back if the "edit" is not to your liking. – Bart Kiers Jun 30 '11 at 08:28
  • @Stephen Melrose, yes, hakre posted the correct link, but his-her interpretation seems a bit off however (at least, I get the impression). No offense to hakre, of course. – Bart Kiers Jun 30 '11 at 08:30
  • 1
    It's funny how easy it is to get `\b` wrong, and assume it matches between non-word characters. It can happen even if you know better. I once made the same mistake *one hour after reading a whole article about the problem.* – Justin Morgan - On strike Sep 08 '11 at 14:45
  • In simple words, `\b` is a position with a word on either left or right side of it. – Jay Dadhania Jan 25 '21 at 21:15
3

One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].

You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.

It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.

Coder
  • 2,833
  • 2
  • 22
  • 24
  • Putting `\b` in character classes like `[\b^']` will result matching a backspace (ASCII 8). Instead we will need to decompile the `\b` shortcut to `(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))`, then add the `'` in all `[\w^']`. See https://stackoverflow.com/a/12712840/229088 . – sglessard Nov 30 '21 at 17:47
0

@ is not part of a word character (in your locale probably it is, however, by default a "word" character is any letter or digit or the underscore character, Source - so @ is not a word character, therefore not \w but \W and as linked any \w\W or \W\w combination marks a \b position), therefore it's always the word boundary that matches (in the OP's regex).

The following is similar to your regexes with the difference that instead of @, a is used. And beginning of line is a word boundary as well, so no need to specify it as well:

$r = preg_match("/\b(animal)/i", "somethinganimal", $match);
var_dump($r, $match);

$r = preg_match("/\b(animal)/i", "something!animal", $match);
var_dump($r, $match);

Output:

int(0)
array(0) {
}
int(1)
array(2) {
  [0]=>
  string(6) "animal"
  [1]=>
  string(6) "animal"
}
hakre
  • 193,403
  • 52
  • 435
  • 836
  • 2
    @Bart Kiers: The PHP regexes refer to PCRE and `\b` is described as: *"A word boundary is a position in the subject string where the current character and the previous character do not both match \w or \W (i.e. one matches \w and the other matches \W), or the start or end of the string if the first or last character matches \w, respectively."* [src](http://php.net/manual/en/regexp.reference.escape.php) - \w and \W is described therein as well. And sure `@` must not but can be part of `\w` and `\W`. – hakre Jun 30 '11 at 08:14
  • @Hakre, I'm not sure if you meant it, but your answer suggest that `\b` matches a `@`, which is wrong: a `\b` matches a position, not a character. – Bart Kiers Jun 30 '11 at 08:20
  • You are correct in that `@` is not a word character and shouldn't be matched by `\b`, and that is how I understood it _should_ work. But alas, in PHP they decided to make it work differently :/ – Stephen Melrose Jun 30 '11 at 08:29
  • 1
    @Stephen, no, `\b` never matches a character. It matches the empty string between two characters. Note that PHP's interpretation of `\b` does not differ from most other popular regex implementations, AFAIK. Perl, Java, Python, etc. all do it like this. – Bart Kiers Jun 30 '11 at 08:31
  • By `@` matching `\b`, I meant detecting that `@` creates a boundary, not matches a physical character. – Stephen Melrose Jun 30 '11 at 08:56
  • @Stephen, fair enough, but still, it's not `@` alone that causes the `\b` to match. It's the `\w` _followed by_ `@` that causes the `\b` to match (in between the two!). – Bart Kiers Jun 30 '11 at 10:30
  • I can't help myself, but I thought the wording as well the linked resources are clear. If you compile your PCRE lib to include `@` being a `word` character (or your locale suggests it), **then** it should work as by OP initials thoughts. However I have not said that this is the case. See the default behavior explained. Will revise the wording. – hakre Jun 30 '11 at 12:17
  • And mind that *locale* was in the meaning of the OP's own domain. If OP considers that `@` is part of a `word` in his own understanding, then this just does not work because the default definition in the regex compiler does not match the OP's locale. That btw is the default PCRE behavior, regardless of PHP. – hakre Jun 30 '11 at 12:29