1

In the javascript code below I need to find in a text exact words, but excluding the words that are between quotes. This is my attempt, what's wrong with the regex? It should find all the words excluding word22 and "word3". If I use only \b in the regex it selects exact words but it doesn't exclude the words between quotes.

var text = 'word1, word2, word22, "word3" and word4';
var words = [ 'word1', 'word2', 'word3' , 'word4' ];
words.forEach(function(word){
    var re = new RegExp('\\b^"' + word + '^"\\b', 'i');
    var  pos = text.search(re); 
    if (pos > -1)
        alert(word + " found in position " + pos);
});
ps0604
  • 1,227
  • 23
  • 133
  • 330

2 Answers2

2

First, we'll use a function to escape the characters of the word, just in case there's some that have special meaning for regexp.

// from https://stackoverflow.com/a/30851002/240443
function regExpEscape(literal_string) {
    return literal_string.replace(/[-[\]{}()*+!<=:?.\/\\^$|#\s,]/g, '\\$&');
}

Then, we construct a regular expression as an alternation between individual word regexps. For each word, we assert that it starts with a word boundary, ends with a word boundary, and has an even number of quote characters between its end, and the end of string. (Note that from the end of word3, there is only one quote till the end of string, which is odd.)

let text = 'word1, word2, word22, "word3" and word4';
let words = [ 'word1', 'word2', 'word3' , 'word4' ];
let regexp = new RegExp(words.map(word =>
'\\b' + regExpEscape(word) + '\\b(?=(?:[^"]*"[^"]*")*[^"]*$)').join('|'), 'g')

text.match(regexp)
// => word1, word2, word4

while ((m = regexp.exec(text))) {
  console.log(m[0], m.index);
}
// word1 0
// word2 7
// word4 34

EDIT: Actually, we can speed the regexp up a bit if we factor out the surrounding conditions:

let regexp = new RegExp(
  '\\b(?:' + 
  words.map(regExpEscape).join('|') + 
  ')\\b(?=(?:[^"]*"[^"]*")*[^"]*$)', 'g')
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • This is probably the better solution because it balances the quotes around something. Mine wouldn't match something like `"word2` or `word2"` where it starts or ends with a quotation mark, but isn't surrounded by one. – Matti Price Nov 28 '18 at 04:14
1

Your excluding of the quote character is wrong, that's actually matching the beginning of the string followed by a quote. Trying this instead

var re = new RegExp('\\b[^"]' + word + '[^"]\\b', 'i');

Also, this site is amazing to help you debug regex : https://regexpal.com

Edit: Because \b will match on quotation marks, this needs to be tweaked further. Unfortunately javascript doesn't support lookbehinds, so we have to get a little tricky.

var re = new RegExp('(?:^|[^"\\w])' + word + '(?:$|[^"\\w])','i')

So what this is doing is saying

(?:         Don't capture this group
^ | [^"\w]) either match the start of the line, or any non word (alphanumeric and underscore) character that isn't a quote
word        capture and match your word here
(?:         Don't capture this group either
$|[^"\w)    either match the end of the line, or any non word character that isn't a quote again
Matti Price
  • 3,351
  • 15
  • 28
  • If seeking `word2`, you'd only find it if the string contained `bword2e` or similar, as your "not a quote" assertions are not null-width, and will have to consume a character each. – Amadan Nov 28 '18 at 03:37
  • since javascript doesn't support lookbehinds, this is a little more annoying, but see if that update works for you @ps0604 – Matti Price Nov 28 '18 at 04:03