1

I have to build a RegExp obejct, that will search words from an array, and will find only whole words match.

e.g. I have a words array ('יל','ילד'), and I want the RegExp to find 'a' or 'יל' or 'ילד', but not 'ילדד'.

This is my code:

var text = 'ילד ילדדד יל';
var matchWords = ['יל','ילד'];
text = text.replace(/\n$/g, '\n\n').replace(new RegExp('\\b(' + matchWords.join('|') + ')\\b','g'), '<mark>$&</mark>');
console.log(text);

What I have tried:

I tried this code:

new RegExp('(יל|ילד)','g');

It works well, but it find also words like "ילדדדד", I have to match only the whole words.

I tried also this code:

new RegExp('\\b(יל|ילד)\\b','g');

but this regular expression doesn't find any word!

How should I build my RegExp?

2 Answers2

2

The word boundary \b is not Unicode aware. Use XRegExp to build a Unicode word boundary:

var text = 'ילד ילדדד יל';
var matchWords = ['יל','ילד'];
re = XRegExp('(^|[^_0-9\\pL])(' + matchWords.join('|') + ')(?![_0-9\\pL])','ig');
text = XRegExp.replace(text.replace(/\n$/g, '\n\n'), re, '$1<mark>$2</mark>');
console.log(text);
<script src="http://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>

Here, (^|[^_0-9\\pL]) is a capturing group with ID=1 that matches either the string start or any char other than a Unicode letter, ASCII digit or _ (a leading word boundary) and (?![_0-9\\pL]) fails the match if the word is followed with _, ASCII digit or a Unicode letter.

With the modern ECMAScript 2018+ standard support, you can use

let text = 'ילד ילדדד יל';
const matchWords = ['יל','ילד'];
const re = new RegExp('(^|[^_0-9\\p{L}])(' + matchWords.join('|') + ')(?![_0-9\\p{L}])','igu');
text = text.replace(re, '$1<mark>$2</mark>');
console.log(text);

Another ECMAScript 2018+ compliant solution that fully emulates Unicode-aware \b construct is explained at Replace certain arabic words in text string using Javascript.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

//Words to join
var words = ['apes', 'cats', 'bazooka'];
//String to search
var str = 'it\'s good that cats and dogs dont wear bazookas';
//End at start of line, end of line or whitespace
var end = '(^|$|\\s)';
//Regular expression string
var regex = end + "(" + words.join('|') + ")" + end;
//Build RegExp
var re = new RegExp(regex, "gi");
//Log results
console.log(str.match(re));

Or as function

var findWholeWordInString = (function() {
  //End at start of line, end of line or whitespace
  var end = '(^|$|\\s)';
  //The actual function
  return function(str, words) {
    //Regular expression string
    var regex = end + "(" + words.join('|') + ")" + end;
    //Build RegExp
    var re = new RegExp(regex, "gi");
    //Return results
    return str.match(re);
  };
})();
//Run test
console.log(findWholeWordInString('it\'s good that cats and dogs dont wear bazookas', ['apes', 'cats', 'bazooka']));
Emil S. Jørgensen
  • 6,216
  • 1
  • 15
  • 28