2

I'm trying to remove stopwords from a string with a single .replace() because I figured out it brings the best performance in this performance test. But I have problems when two stop words follow each other, like in the snippet below:

var stopWordsRE = /((?:^|\s+?)(foo|bar)(?:$|\s+?))/gi;
var text = "foo bar baz bar foobar";
var filtered = text.replace(stopWordsRE, " ");
console.log(filtered); // bar baz foobar

But it's supposed to return:

baz foobar

The problem is that the regular expression matches foo and the succeeding whitespace, such that there is no preceding whitespace anymore for bar to match. I thought the non-capturing groups would suffice, such that the whitespace is not remembered. But apparently not, can you tell me how to fix the regex such that it matches stopwords following each other?

rob
  • 2,904
  • 5
  • 25
  • 38

2 Answers2

2

Try to match using word boundaries: \b

var stopWordsRE = /(\b(foo|bar)\b\s*)/gi;

That matches multiple times on the line (g flag), case-insensitive (i flag), as you already had.

And it matches any foo or bar that is a full word. That is both ends of the string are bounded by word boundaries, which are zero-length anchors that correspond the beginning or ending of words.

Finally, the \s* grabs any (or no) whitespace along side the word, so you don't end up with multiple spaces in between the remaining words.

OtherDevOpsGene
  • 7,302
  • 2
  • 31
  • 46
  • This does indeed solve the problem from the question and is also an improvment. But it seems to have troubles with german umlaute. It matches the `bar` in `häbar`. At least in Chrome, but this might also be a bug in V8. Will test it in other browsers as well. – rob Oct 13 '14 at 19:27
  • @rob Good point. See http://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters – OtherDevOpsGene Oct 13 '14 at 19:29
  • 1
    I like this solution better but with the current mess with JS and unicode it's not an option. – rob Oct 13 '14 at 19:38
2

Instead of matching the space after foo or bar you need to use positive lookahead:

var stopWordsRE = /(?:^|\s+)(?:foo|bar)(?=\s+|$)/gi;
var filtered = text.replace(stopWordsRE, "").trim();
//=> "baz foobar"
anubhava
  • 761,203
  • 64
  • 569
  • 643