0

I'm working on a Chrome extension that involves replacing words. There are around 6500 words I want to look for in any given page.

I'm using regex expressions to find the words and replace them, but it's way too slow: around 10 seconds for an average news article.

Is there a more efficient way to find a lot of specific words in a large string?

Kalnode
  • 9,386
  • 3
  • 34
  • 62
nadiacw
  • 21
  • 6
  • [Trie](https://en.wikipedia.org/wiki/Trie) – Amit Dec 23 '17 at 22:14
  • 1
    Just out of curiosity, can you post the regular expressions you created? – clearlight Dec 23 '17 at 22:27
  • @clearlight for example https://regex101.com/r/y4RcUM/1 – nadiacw Dec 23 '17 at 22:37
  • @wp78de javascript – nadiacw Dec 23 '17 at 22:38
  • How about `\w+` with a function that checks if the word is the right word (in the set)? Many `|`s involve a lot of backtracking. Something like [this](https://jsfiddle.net/oozyodr9/). The words list is rather primitive, which is an array even, and still rather fast. – Caramiriel Dec 23 '17 at 23:20
  • The regexp itself may not be the bottleneck here. Use [Devtools profiler](https://developers.google.com/web/tools/chrome-devtools/rendering-tools/js-execution) first to see what's causing the delays: maybe your node traversal code needs to be reworked. – wOxxOm Dec 24 '17 at 10:04

3 Answers3

0

I think you try to achieve too much at once. If you have a large pattern with a LOT of alternations with the vertical bar/pipe | your pattern gets slow because the regex engine has to backtrack a lot.

Therefore, I suggest to chain-replace instead.

Here are two ReplaceAll candidates to play with:

//Regular Expression Based Implementation
String.prototype.replaceAll = function(search, replacement) {
    var target = this;
    return target.replace(new RegExp(search, 'g'), replacement);
};
//Split and Join (Functional) Implementation
String.prototype.replaceAll2 = function(search, replacement) {
    var target = this;
    return target.split(search).join(replacement);
};
var t0 = performance.now();
//your Approach
var t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")

var t0 = performance.now();
str.replaceAll('Erica', 'Example');
//...
var t1 = performance.now();
console.log("Call to doSomething took " + (t1 - t0) + " milliseconds.")

var t0 = performance.now();
str.replaceAll2('Erica', 'Example');
//...
var t1 = performance.now();
wp78de
  • 18,207
  • 7
  • 43
  • 71
  • Extra tip: Sort your `words` descending order by length (long to short). This helps to avoid erroneous replacements. – wp78de Dec 23 '17 at 23:37
0

A fastest way to find a word match is to use both regular expressions and string parsing. For example, we have to find a list of email addresses within a large text. With regular expression the system try to match the local part of the address, that is before the @ character, and then the domain name. This is hard to find. Instead, you can iterate through all the text, finding for @ characters, and then, check if the syntax is right using regexp.

JulianSoto
  • 182
  • 1
  • 4
  • 15
  • Sounds good, but the words I'm searching for don't share any special characters, it's a long list of first names, like [this](https://regex101.com/r/y4RcUM/1). – nadiacw Dec 24 '17 at 08:40
0

A simple way you can increase the efficiency of your regular expression is to prevent it from checking words that start with lowercase letters right off the bat. First search for a word boundary and a capital letter, and then in your list of names, remove the first letter from each of them. Using your example, this is what your regex would look like:

\b[A-Z](rica|ary|essica)\b

(Also notice the change made with the word boundaries for a faster and more condensed regular expression.)

Josh Withee
  • 9,922
  • 3
  • 44
  • 62
  • Yes, in my example I'm using names, so you could specify that the first letter will always be capital, but this alternative would match words I'm not looking for (for example "Lary", "Gary") and I want to make sure that won't happen. – nadiacw Dec 24 '17 at 08:06
  • You can use negative look-ahead: `\b(?![^A-Z])(Erica|Mary|Jessica)\b`. You can also optimize your long regexp using a special utility e.g. perl's [Regexp::Optimizer](http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm) – wOxxOm Dec 24 '17 at 09:32