1

I am trying to write a regex that matches a very long list of words (4000 words) if the word is at the start of the string or at the end of the string or preceded and followed by a special character, the current regex I am using is this:

((?:[^a-zA-Z0-9]|^)FIND(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)ANY(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)MATCHING(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)WORD(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)BY(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)THIS(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)VERY(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)LONG(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)REGEX(?:[^a-zA-Z0-9]|$))|((?:[^a-zA-Z0-9]|^)PATTERN(?:[^a-zA-Z0-9]|$))

This regex goes on for around 4000 words and I use python re module / ripgrep to check some strings for matches and I want to know the matching word for every string.

I am using non capturing groups because I don't really mind what comes before or after the word, only the word matched itself.

However, this takes around 3-4 seconds per iteration on a raspberry pi for some generic string I tested with, and I want to know if I can somehow generate a faster pattern for this usage.

Thanks.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Maybe it will be enough to build a pattern like `(?<![^\W_])(?:FIND|ANY|etc.)(?![^\W_])`? If `FIND`, `ANY`, etc. are too many words, you may even create a regex trie from them for faster whole word search. – Wiktor Stribiżew Sep 17 '21 at 15:07
  • @WiktorStribiżew Thanks, it is indeed faster! – user3731180 Sep 17 '21 at 15:51

1 Answers1

1

First of all, the (?:[^a-zA-Z0-9]|^) and (?:[^a-zA-Z0-9]|$) patterns are used here as word boundaries with _ excluded. It makes sense to streamline them and use (?<![^\W_]) and (?![^\W_]) respectively.

Next, the words you have can be processed to create a regex trie out of them for efficient search.

Here is an example code:

from trieregex import TrieRegEx
keywords = ['FIND', 'ANY', 'MATCHING', 'WORD', 'BY', 'THIS', 'VERY', 'LONG', 'REGEX', 'PATTERN', 'PARROT', 'FIGHT']
pattern = fr'(?<![^\W_])({TrieRegEx(*keywords).regex()})(?![^\W_])'
# => (?<![^\W_])((?:PA(?:TTERN|RROT)|FI(?:GHT|ND)|MATCHING|REGEX|LONG|THIS|VERY|WORD|ANY|BY))(?![^\W_])

Just make sure you install trieregex beforehand.

See the resulting regex pattern.

See also another demo based on this regex trie solution:

import re

class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())


text = r'FIND ANY MATCHING WORD BY THIS VERY LONG REGEX PATTERN FIGHT FIGHTER PARROT PARROT_ING'
keywords = ['FIND', 'ANY', 'MATCHING', 'WORD', 'BY', 'THIS', 'VERY', 'LONG', 'REGEX', 'PATTERN', 'PARROT', 'FIGHT']
trie = Trie()
for word in keywords:
    trie.add(word)
pattern = fr'(?<![^\W_])({trie.pattern()})(?![^\W_])'
print(re.findall(pattern, text))

Output:

['FIND', 'ANY', 'MATCHING', 'WORD', 'BY', 'THIS', 'VERY', 'LONG', 'REGEX', 'PATTERN', 'FIGHT', 'PARROT', 'PARROT']

Note the two occurrences of PARROT, the last one comes from the PARROT_ING string part.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thanks, I have never heard of trieregex before. my script takes 30 minutes instead of 3 days now :) – user3731180 Sep 17 '21 at 18:18
  • just mentioning, while it is an interesting project, sometimes creating two TrieRegEx objects from different keywords resulted in the same regex (although different keywords were used) like there is some sort of cache it uses / object creation is wrong.. Also, sometimes the object regex was empty.. I am using the regex pattern you suggested but without trieregex now.. – user3731180 Sep 17 '21 at 18:59
  • 1
    @user3731180 You do not have to rely on that library. You may use [this solution](https://stackoverflow.com/a/42789508/3832970) to create regex tries. – Wiktor Stribiżew Sep 17 '21 at 19:14