0

According to this post, if the position is not at the beginning or end of the string, only a word character ([0-9A-Za-z_]) defines the word boundary.

However, the following code returns something I didn't expect (example derived from this book in the section Dynamically creating RegExp objects)

let name = "dea+hl[]rd";
let text = "dea+hl[]rd is a suspicious character.";
let regexp = new RegExp("\\b(" + name + ")\\b", "gi");
console.log(text.replace(regexp, "_$1_"));
// → dea+hl[]rd is a suspicious character.

Shouldn't the first matched group be dea because + is not a word character? I expect the replaced string to be _dea_+hl[]rd is a suspicious character.

In addition, when I replace it with let name = "";, the output becomes __dea__+__hl__[]__rd__ __is__ __a__ __suspicious__ __character__. Where do the underscores come from?

The corrected code shown in the book is cryptic to me as well

let name = "dea+hl[]rd";
let text = "This dea+hl[]rd guy is super annoying.";
let escaped = name.replace(/[\\[.+*?(){|^$]/g, "\\$&");
let regexp = new RegExp("\\b" + escaped + "\\b", "gi");
console.log(text.replace(regexp, "_$&_"));
// → This _dea+hl[]rd_ guy is super annoying.

How does adding backlashes before the special character affect the word boundary?


To answer my own question (because my question is closed by some irrelevant duplicate), I test my regex on regex101. I didn't try it because RegExp is not accepted syntax there. Anyway, the reason there is essentially no matching string in the first example is that /\b(dea+hl[]rd)\b/ is not a valid regex. [] is special character denoting a set of characters. There is no way to find some matching string when the regex cannot be evaluated. So text.replace(regexp, "_$1_") just returns text.

When name = "", underscores come from the fact that $1$ always matches word boundaries. If we mark word boundaries by |, the word boundaries in text are |dea|+|hl|[]|rd| |is| |a| |suspicious| |character|.

Finally, escaping special characters does not change the behavior of word boundary. It is just there to make the regex valid for the reasons mentioned above.

1 Answers1

0
  1. In you first example you are not escaping + character, so it means "matches the preceding expression 1 or more times". In your case it would mean, that letter a should be present after de one or more times (dea, deaa etc). Then there are unescaped square braces. Usually [xyz] means "any of x, y or z", but in your case the braces are empty, so it's like "there should be one of nothing". Regular expression with empty square braces will not match anything. So, as it is shown in your last example, you should escape special characters first.

  2. About word boundary. In post you referred it is said that "A word boundary ... is a position between \w and \W (non-word char)". In your case a matches \w (a is a word character) and + matches \W (+ is a non-word character). So there is word boundary between a and +. Similarly there is word boundary between + and h, between l and [ and so on.

Alex Gessen
  • 550
  • 5
  • 4