Remove stop words

Question

I'm trying to remove stopwords from a string with a single .replace() because I figured out it brings the best performance in this performance test. But I have problems when two stop words follow each other, like in the snippet below:

var stopWordsRE = /((?:^|\s+?)(foo|bar)(?:$|\s+?))/gi;
var text = "foo bar baz bar foobar";
var filtered = text.replace(stopWordsRE, " ");
console.log(filtered); // bar baz foobar

But it's supposed to return:

baz foobar

The problem is that the regular expression matches foo and the succeeding whitespace, such that there is no preceding whitespace anymore for bar to match. I thought the non-capturing groups would suffice, such that the whitespace is not remembered. But apparently not, can you tell me how to fix the regex such that it matches stopwords following each other?

I can assure you, it is not. But if the solution is simple or somehow obvious, I apologise... — rob, Oct 13 '14 at 19:02
@rob It doesn't seem like a homework problem. And even if it did, you did a lot more than most people that come here with a homework problem asking for a solution — Ian, Oct 13 '14 at 19:07

OtherDevOpsGene · Answer 1 · 2014-10-13T19:17:28.140

2

Try to match using word boundaries: \b

var stopWordsRE = /(\b(foo|bar)\b\s*)/gi;

That matches multiple times on the line (g flag), case-insensitive (i flag), as you already had.

And it matches any foo or bar that is a full word. That is both ends of the string are bounded by word boundaries, which are zero-length anchors that correspond the beginning or ending of words.

Finally, the \s* grabs any (or no) whitespace along side the word, so you don't end up with multiple spaces in between the remaining words.

edited Oct 13 '14 at 19:17

answered Oct 13 '14 at 19:10

OtherDevOpsGene

7,302
2
31
46

This does indeed solve the problem from the question and is also an improvment. But it seems to have troubles with german umlaute. It matches the `bar` in `häbar`. At least in Chrome, but this might also be a bug in V8. Will test it in other browsers as well. – rob Oct 13 '14 at 19:27
@rob Good point. See http://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters – OtherDevOpsGene Oct 13 '14 at 19:29
1

I like this solution better but with the current mess with JS and unicode it's not an option. – rob Oct 13 '14 at 19:38

score 2 · Accepted Answer · answered Oct 13 '14 at 19:20

2

Instead of matching the space after foo or bar you need to use positive lookahead:

var stopWordsRE = /(?:^|\s+)(?:foo|bar)(?=\s+|$)/gi;
var filtered = text.replace(stopWordsRE, "").trim();
//=> "baz foobar"

answered Oct 13 '14 at 19:20

anubhava

761,203
64
569
643

Remove stop words

2 Answers2