-2

I have this regex:

"(WORD1.*WORD2.*WORD3)|(WORD1.*WORD3.*WORD2)|(WORD2.*WORD1.*WORD3)|(WORD2.*WORD3.*WORD1)|(WORD3.*WORD1.*WORD2)|(WORD3.*WORD2.*WORD1)"

It matches to these words:

WORD1WORD2WORD3
WORD1AWORD2BWORD3C
WORD3WORD1WORD2
WORD1WORD2WORD3WORD1

But not these words:

WORD1WORD1WORD2
WORD1AWORD1BWORD2C

This regex matches when it finds a string where there are the 3 words (WORD1, WORD2, WORD3) in any order.

I would like to do the same thing with more words but the problem is that the size of the regex increases exponentially with the number of words. Is it possible to simplify the way this regex is constructed to solve this problem (that the size does not increase exponentially) ?

InSync
  • 4,851
  • 4
  • 8
  • 30
ipStack
  • 381
  • 1
  • 12
  • 3
    Maybe regex isn't the right tool for this? – evolutionxbox May 04 '23 at 15:55
  • Could you tell us exactly what you are trying to accompish here? It might seem obvious to you - but take a step back and explain :) – Thomas Frank May 04 '23 at 16:00
  • Beacuse WORD is not a magic part of regex.. do you actually intend to match when somebody writes "WORDX" and why? – Thomas Frank May 04 '23 at 16:01
  • The aim is for the user to find strings according to the keywords they want to search for (not necessarily in order) – ipStack May 04 '23 at 16:04
  • First begin with your intention - what is your input and what is your expected output. Give examples. – Steve Tomlin May 04 '23 at 16:06
  • Is it valid if the same word occurs twice in the string? Like `WORD1WORD2WORD3WORD1` – The fourth bird May 04 '23 at 20:49
  • Yes, '''WORD1WORD2WORD3WORD1''' is valid. User input is a list a keywords (separate with space caracter) and the script transforms it into a regex to execute the search... perahps it isn't the good way to do it. – ipStack May 05 '23 at 07:22
  • Not much shorter, but certainly more "fun" (and useful if you need to extract the matches): `(WORD1|WORD2|WORD3).*(?!\1)(WORD1|WORD2|WORD3).*(?!\1|\2)(WORD1|WORD2|WORD3)` https://regex101.com/r/0jdsxX/2 – Jay May 05 '23 at 10:40
  • @ipStack The detail about keywords being separated with a space character is actually an important distinction. It can make the difference of whether "ORDER" is present in, for example, the set { "EDGE", "BORDER" }. – phatfingers May 05 '23 at 17:15

2 Answers2

2

You could use positive lookahead for each of the words.

/(?=.*WORD1)(?=.*WORD2)(?=.*WORD3).*/

The more performant version below specifies a starting anchor and only matches a single character after verifying lookaheads. This technique is intended only for matching, as requested by the OP, and not intended for extraction.

/^(?=.*WORD1)(?=.*WORD2)(?=.*WORD3)./

A positive lookahead acts like a gate in that it proceeds only if the specified match within its parentheses exists, but it doesn't consume or capture what it matches-- it's always zero-length. If you "look ahead" to see if each word exists preceded by anything .*, then it doesn't matter what order those words are in. You proceed if true for each word without the match using up any content.

If you're only concerned with whether the content matches or not, then the only material difference between the two expressions is how long they take. Let's say you have only 2 of the 3 required words in your content. Unless the software that interprets the expression can recognize the futility of trying, it may look for the three words at the first position, fail, then try at the second position, fail, etc. until it reaches the last position before it gives up. By specifying ^, it will only check at the first position, saving the time of the other unnecessary checks. Removing the * from the end prevents some unnecessary capture when you're just looking for a true/false answer of whether all your words exist in the content.

phatfingers
  • 9,770
  • 3
  • 30
  • 44
  • 1
    Note that this would match the whole string (e.g. `A WORD1 B WORD3 WORD2 C`), whereas the original would only match as needed (e.g. (`WORD1 B WORD 3 WORD 2`). That is, if the match's content is important. – InSync May 04 '23 at 16:18
  • You could add an anchor to the pattern like `^(?=.*WORD1)(?=.*WORD2)(?=.*WORD3).*` to prevent the lookahead from firing on multiple positions when there is no match. – The fourth bird May 04 '23 at 20:51
  • 1
    @Thefourthbird Nice optimization. I'll add it to the answer. – phatfingers May 04 '23 at 23:55
  • I thank you, the 2 regex expression work... but I don't understand how it works and what is the difference. Can you explain with more details ? – ipStack May 05 '23 at 07:42
  • Although it would fail for your given examples, which don't separate words by spaces or other delimiters, if your words actually are separated, then I would recommend using '`\bWORD\b` for each of the words in the expression. – phatfingers May 05 '23 at 17:07
2

Simply iterate over all strings and filter out all those which does not include all keywords:

(a terser version can be found in the snippet below)

function findMatch(strings, keywords) {
  const result = [];
  
  for (const string of strings) {
    if (keywords.every(keyword => string.includes(keyword))) {
      result.push(string);
    }
  }
  
  return result;
}

Try it:

console.config({ maximize: true });

function findMatch(strings, keywords) {
  return strings.filter(
    string => keywords.every(keyword => string.includes(keyword))
  );
}

const testcases = [
  'WORD1WORD2WORD3',
  'WORD1AWORD2BWORD3C',
  'WORD3WORD1WORD2',
  'WORD1WORD2WORD3WORD1',
  'WORD1WORD1WORD2',
  'WORD1AWORD1BWORD2C'
];

const keywords = [
  'WORD1', 'WORD2', 'WORD3'
];

console.log(findMatch(testcases, keywords));
<script src="https://gh-canon.github.io/stack-snippet-console/console.min.js"></script>
InSync
  • 4,851
  • 4
  • 8
  • 30
  • "Note that this would match the whole string (e.g. A WORD1 B WORD3 WORD2 C), whereas the original would only match as needed (e.g. (WORD1 B WORD 3 WORD 2). That is, if the match's content is important. – InSync" – phatfingers May 04 '23 at 16:43
  • @phatfingers That was a note for readers, not for you. Do you seriously need to copy and paste it verbatim? – InSync May 04 '23 at 16:47
  • Just teasing. In all probability, the OP is just looking for true/false of whether it matches and your solution would satisfy that just fine. – phatfingers May 04 '23 at 16:48
  • Is this solution better than the one with positive lookahead regexes? – ipStack May 05 '23 at 07:47
  • @ipStack This operates directly on strings, so it's potentially faster than using regex. I haven't done a benchmark, though. – InSync May 05 '23 at 23:57