0

What's the best method to find any of a list of substrings in a specific string?

This works, but can't be right.

var searchEngines = [
    new RegExp("www.google."),
    new RegExp("www.yahoo."),
    new RegExp("search.yahoo."),
    new RegExp("www.bing.")
  ];

function isSearchEngine(url){
  for (let i=0,len=searchEngines.length; i < len; i++){
    if (searchEngines[i].exec(url)) {
      return true;
    }
  }
  return false;
}

Anything to speed this up, really...

[Edit:] After rooting around I found this:

var searchEngines = [
      "www\.google\.",
      "www\.yahoo\.",
      "search\.yahoo\.",
      "www\.bing\.",
      "duckduckgo\."
    ].join('|');

    if (excludeSearch) {
      read = ! (new RegExp(searchEngines, 'gi')).test(keyword);
    }

// After the Map object was released in HTML5 I had this at my disposal as well
const imageExtensions = new Map();
  ['jpeg', 'jpg', 'jif', 'jfif', 'gif', 'tif', 'tiff', 'png', 'pdf', 'jp2', 'jpx', 'j2k', 'j2c', 'fpx', 'pcd'].forEach(function(e) {
    imageExtensions.set(e,true);
  });
  
  • How slow can that possibly be? There aren't that many search engines. – Ken White May 24 '17 at 22:37
  • You chould combine all of those partial urls into a single regex and just do one regex.exec() – Ken May 24 '17 at 22:38
  • How do I turn "www.google.", "www.yahoo.", etc. into versions with the backslashes without wasting too much time? /www\.google\.|www\.yahoo\.|search\.yahoo\.|www\.bing\./i.test(url) – Kevin Crosby May 24 '17 at 23:07
  • If the list of substrings remains constant over some time, go with e.g. https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm – le_m May 26 '17 at 01:33

3 Answers3

2

Try a single regex using the | character for alternative values. Now instead of looping through an array, you can simply return a single regex test.

function isSearchEngine(url){
  return /www\.google\.|www\.yahoo\.|search\.yahoo\.|www\.bing\./i.test(url);
}

If your match strings are in an array, try something like this:

    function isSearchEngine2(url, array){
      var fullRegString = array.join("|");//add regex escape characters here if necessary
      return new RegExp(fullRegString).test(url);
    }

    //array of strings we want to match -- ideally add escape characters to these if necessary
    var searchEngines = [
      "www.google.",
      "www.yahoo.",
      "search.yahoo.",
      "www.bing."
    ];

    console.log(isSearchEngine2('www.google.com', searchEngines));//true -correct
    console.log(isSearchEngine2('abcdefg', searchEngines));//false - correct
    console.log(isSearchEngine2('wwwAgoogleAcom', searchEngines));//true -incorrect mis-match because of '.' matching all
Will P.
  • 8,437
  • 3
  • 36
  • 45
  • I thought I tried that, but maybe I got my slashes all topsy turvy. But the real question is that those aren't the only choices which can be edited such that I need to OR | that list dynamically. Nevermind. Got it figured out I'm sure. Just new to /regex/ stuff. – Kevin Crosby May 24 '17 at 22:44
  • But now I'm seeing the \. and don't know how to make such dynamic replacements fast. – Kevin Crosby May 24 '17 at 22:47
  • So the `.` character in a regular expression matches every character, so technically, it would work without adding the escaping \ character in front. In that case, it would also match the value `wwwCgoogleC` (or any other character in place of C). That may not be a problem, but it's basically an extra possible mis-match you might encounter in the future. To replace all escapable characters, [this SO answer](https://stackoverflow.com/a/770533/711674) would work but it's probably not all that efficient – Will P. May 25 '17 at 16:51
0

Here is something a little more generic. This will return the string you pass in if it is found in the string you are searching against.

function findIn (str, here) {
    let location = here.indexOf(str),
    found = here.slice(location, location + str.length);
    if (found) {
        return found;
    } else {
        return `Sorry but I cannot find ${str}`;
    }
}

/** examples
console.log(findIn('hoo', "www.yahoo.com/news/some-archive/2103547001450"));

console.log(findIn('www', "www.yahoo.com/news/some-archive/2103547001450"));

console.log(findIn('news', "www.yahoo.com/news/some-archive/2103547001450"));

console.log(findIn('arch', "www.yahoo.com/news/some-archive/2103547001450"));
*/
colecmc
  • 3,133
  • 2
  • 20
  • 33
0

Are you expecting a simple true/false from the url, or are you expecting to find multiple searchEngines in one string? I assume it's the former, as urls don't really contain multiple addresses....

Generally, String.indexOf() has the best performance for matching characters. Here's a benchmark I did a while back on various string parsing methods. The benchmark itself is set up to test if multiple words are all present instead of one instance, so RegExp.test() takes the cake there, but performance suffers HARD when the result is false. String.indexOf() was by far the most reliable for parsing true/false matches and easily the most performant when testing one string for one single value (don't have the benchmark for that, sorry);

However, you're doing this in a loop to test for multiple things. As you can see on the benchmark, RegExp.test() is the most performant on successes. If we can assume most of the urls you're passing to the function contain one of those urls, I would recommend using that:

var searchEngines = [
    "www.google.",
    "www.yahoo.",
    "search.yahoo.",
    "www.bing."
  ];

function isSearchEngine(url){
  let regex = new RegExp(searchEngines.join('|'), 'gi');
  return regex.test(url); // returns true/false
}
joh04667
  • 7,159
  • 27
  • 34