Why special characters like = or " break PHP regexp when using \b word boundary?

Question

this is a follow up after reading How to specify "Space or end of string" and "space or start of string"?

From there, it states means to match a word in a phrase. I can even add a few other solutions. But as soon as a = or " is added, it quit working. Why?

i am going to search for stackoverflow and replace it with OK using preg_replace()

preg_replace('/\bstackoverflow\b/', 'OK', $input_line)

input:
1: stackoverflow xxx
2: xxx stackoverflow xxx
3: xxx stackoverflow
result:
1: OK xxx
2: xxx OK xxx
3: xxx OK

now, if i change it to match stackoverflow="", it stops working.

preg_replace('/\bstackoverflow=""\b/', 'OK', $input_line)

input:
1: stackoverflow="" xxx
2: xxx stackoverflow="" xxx
3: xxx stackoverflow=""
result:
1: stackoverflow="" xxx
2: xxx stackoverflow="" xxx
3: xxx stackoverflow=""

the same will happen if i use on my regex: /\bstackoverflow=\b/ or /\bstackoverflow"\b/. I already checked the manual if = or " are special chars, they are not. but i even tried /\bstackoverflow\=\"\"\b/

Why is that?

in that example removing \b will also solve it, but it will also match nostackoverflow=""not which i do not want.

i also tried alternatives to \b such as [ ^] and ( |^). Interestingly [ ^] (space or beginning of line) will not work for beginning of line, only space. But ( |^) will work fine for both.

It's the `\b` that mess up, if you only use `/stackoverflow=""/ it work — Blag, Nov 23 '15 at 23:30
@Blag i also mention that on the question. but that will match the search term in the middle of another search term. which is not desirable. — gcb, Nov 24 '15 at 00:01
yes, I forget the `(\s|^)` and `(\s|$)`, but as miken32 post his answer, I just +1 him without edit this ;) — Blag, Nov 24 '15 at 00:10

miken32 · Answer 1 · 2015-11-24T00:08:25.740

6

The problem is your use of \b which is a "word boundary." It's a placeholder for (^\w|\w$|\W\w|\w\W), where \w is a "word" character [A-Za-z0-9_] and \W is the opposite. The problem is that a " doesn't match the "word" characters, so the boundary condition is not met.

Try using a \s instead, which will match any whitespace character.

(?:^|\s)stackoverflow=""(?:\s|$)

Characters inside a class are not interpreted, except for ^ used as a negation operator at the beginning of a class, and - as a range operator. This is why [ ^] wouldn't work for you. It was searching for a literal ^.

$ php -a
Interactive shell

php > $input_line='
php ' stackoverflow="" xxx
php ' xxx stackoverflow="" xxx
php ' xxx stackoverflow=""
php ' ';
php > echo preg_replace('/(?:^|\s)stackoverflow=""(?:\s|$)/', 'OK', $input_line);
OKxxx
xxxOKxxx
xxxOK

https://regex101.com/r/nP2aB8/1

edited Nov 24 '15 at 00:08

answered Nov 23 '15 at 23:32

miken32

42,008
16
111
154

that solves the example problem perfectly, as my `( |^)` did, which i mention in the question. but the question is why it breaks the `\b` solution only when there is `=` or `"` in the match? – gcb Nov 24 '15 at 00:00
Because `\b` is looking for certain characters as mentioned in the answer. It's a "word boundary" and the `"` is not a word character. – miken32 Nov 24 '15 at 00:02
This regex [won't be able to deal with matching `stackoverflow=""` in `,stackoverflow="" xxx`](https://regex101.com/r/tL9kX1/1). It is also not correct to say *`\b` is a placeholder for `(^\w|\w$|\W\w|\w\W)`* as it matches an empty location between the subpatterns listed between alternatives. – Wiktor Stribiżew Nov 24 '15 at 00:06
1

So basically; \b only matches when there is a word character on one side and a breaking character (or nothing) on the other. – jcuenod Nov 24 '15 at 00:07
what does the `"` have to do with `\b` guys?! it is matching "space or beginning/end of line". it /never/ have to match the `"` or the `=`. Remember that the `=""` in the example is explicitly typed and matched on the regexp. it is not matched by the `\b` or the `\s` or a simple ` ` or `^` or `$` (all which also work there) – gcb Nov 24 '15 at 00:08
@gcb read the answer. It's not matching "space or beginning/end of line." – miken32 Nov 24 '15 at 00:09
@miken32 why it works for the 3 examples in the beginning of the question? and the question i referenced too. – gcb Nov 24 '15 at 00:10
@stribizhev I didn't want to try explaining an assertion, so treating it like a character class seemed like the easiest way. You're right about using lookarounds to do this properly though. – miken32 Nov 24 '15 at 00:13
@gcb because those are word characters – miken32 Nov 24 '15 at 00:14
fwiw, i used in production `(?: |^)` and `(?: |$)` respectively since i must only match spaces, and they look cooler that way. but i am still baffled why the `\b` won't matches empty locations as @stribizhev mentioned /only/ when the string next to it have those chars. – gcb Nov 24 '15 at 00:16
OK. i see my dumb ways now. `\b` requires a `\w`(simplifying) on **BOTH** sides of it? irregardless of what else is in the match? ...so it goes a little beyond just matching the `\w`... it also matches a `\w|$|^` before and after it. – gcb Nov 24 '15 at 00:19
2

A word boundary `\b` is equivalent to `(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))` which means: *Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string)*. **OR** - *Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).* – Wiktor Stribiżew Nov 24 '15 at 00:20
thanks for the insightful discussion everyone. I think we finally got to the bottom of it. and apologies for the stubbornness along the way. but judging for the tons of questions showing up and being deleted, this is a topic that confuses a lot o people. – gcb Nov 24 '15 at 00:27

Wiktor Stribiżew · Accepted Answer · 2020-04-28T14:10:14.863

Background

From the regular-expressions.info Word boundaries page:

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.

There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.

A very good explanation from nhahtdh post:

A word boundary \b is equivalent to:
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
Which means:

Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).

OR

Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).

What's wrong with your regex

The reason why \b is not suitable is because it requires a word/non-word character to appear after/before it which depends on the immediate context on both sides of \b. When you build a regex dynamically, you do not know which one to use, \B or \b. For your case, you could use '/\bstackoverflow=""\B/', but it would require a smart word/non-word boundary appending. However, there is an easier way: use negative lookarounds.

Solution

(?<!\w)stackoverflow=""(?!\w)

See regex demo

The regex contains negative lookarounds instead of word boundaries. The (?<!\w) lookbehind fails the match if there is a word character before stackoverflow="", and (?!\w) lookahead fails the match if stackoverflow="" is followed by a word character.

What a word shorthand character class \w matches depends if you enable the Unicode modifier /u. Without it, a \w matches just [a-zA-Z0-9_]. You can lay further restrictions using the lookarounds.

Demo

PHP demo:

$re = '/(?<!\w)stackoverflow=""(?!\w)/'; 
$str = ",stackoverflow=\"\" xxx\nxxx stackoverflow=\"\" xxx\nxxx stackoverflow=\"\"\nstackoverflow=\"\" xxx"; 
echo preg_replace($re, "NEW=\"\"", $str);

NOTE: If you pass your string as a variable, remember to escape all special characters in it with preg_quote:

$re = '/(?<!\w)' . preg_quote($keyword, '/') . '(?!\w)/';

Here, notice the second argument to preg_quote, which is /, the regex delimiter char.

*why it breaks the `\b` solution only when there is `=` or `"` in the match?* - Explained in my answer. — Wiktor Stribiżew, Nov 24 '15 at 00:05
A very interesting [post describing word and non-word boundaries](http://stackoverflow.com/a/16624542/3832970). — Wiktor Stribiżew, Nov 24 '15 at 00:18
*I used in production `(?: |^)` and `(?: |$)` respectively since i must only match spaces, and they look cooler that way.* - No, not cooler. You miss on Unicode spaces like a hard space. You should use `(?:^|\h)` and `(?:$|\h)` if you must match between spaces only. — Wiktor Stribiżew, Nov 24 '15 at 00:23
Thanks for the attention! it is an ascii protocol. i want to enforce the correct char there, which is space only. — gcb, Nov 24 '15 at 00:31
That information is not part of the question. BTW, `^` matches the start of a string/line when it is outside of a character class. Inside a character class, it either means a negation of the characters inside it (if it is the first character after `[`) or a literal `^` if it is placed somewhere further inside the character class (as in your `[ ^]`). — Wiktor Stribiżew, Nov 24 '15 at 00:33

score 2 · Answer 3 · answered Nov 23 '15 at 23:33

" is, of course, not special.

The word boundary, \b, OTOH, is. It looks for a word beginning/ending, and on the boundary it expects a word character - and the quote is not such a character.

Remove it from the end or replace it with a negative look-ahead search for a word character.

Why special characters like = or " break PHP regexp when using \b word boundary?

3 Answers3

Background

What's wrong with your regex

Solution

Demo