1

I have a large dictionary containing regex values as the key and a numeric value as a value, and given a corpus (broken down into a list of individual word tokens) I would like to to find the regex value that best matches my word to obtain its respective value.

The dictionary contains many regex values that are ambiguous, in the sense that a word may have multiple regex matches, and therefore you would want to find the longest regex or 'best match' (ex: dictionary contains affect+, as well as affected an affection)

My issue is when running a large text sample through the dictionary and finding the regex match of each word token, it takes a long amount of time (0.1s per word), which obviously adds up over 1000's of words. This is because it goes through the whole dictionary each time to find the 'best match'.

Is there a faster way to achieve this? Please see the problematic part of my code below.

for word in textTokens:
    for reg,value in dictionary.items():
        if(re.match(reg, word)):
            matchedWords.append(reg)
  • Have these regexp a special structure, or is everything that is a valid regexp possible? – MrSmith42 Sep 03 '21 at 16:06
  • There is only one type of regexp used throughout the entire dictionary and it is '+' quantifier after a word (fault+, go+, etc). – CSStudent19583 Sep 03 '21 at 16:45
  • 1
    Sooooo what you're saying is that python's `re` module is not suited for this task at all, because those are basically not regexps. – Stef Sep 03 '21 at 16:50
  • 2
    The first thing that comes to mind when I read your question is a [prefix tree](https://en.wikipedia.org/wiki/Trie), which you can use to find the longest prefix that matches a word. – Stef Sep 03 '21 at 16:51
  • Haha great, I'm glad I've spent so much time fussing over the wrong technique! I'll try and implement a prefix tree now. Thank you – CSStudent19583 Sep 03 '21 at 16:56
  • *"Haha great, I'm glad I've spent so much time fussing over the wrong technique!"* Welcome to the wonderful world of research ;) – Stef Sep 03 '21 at 17:01
  • Helpful resources: [How to create a trie in Python?](https://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python) although maybe you would prefer not to look at spoilers :p – Stef Sep 03 '21 at 17:12
  • No no, I'll take all the spoilers I can get!! thanks again Stef. – CSStudent19583 Sep 03 '21 at 17:14
  • If you do end up with a working solution, you can post an answer to your own question. – Stef Sep 05 '21 at 13:03
  • Will do. I've stuck with my slow method for now, because i have to hand my project in next week and need to focus on completing other area's of the project before i can devote time to optimising this function more. I will definitely post here when i'm done to give everyone some closure, and describe the solution used. – CSStudent19583 Sep 06 '21 at 12:32
  • 1
    Hi Stef. I know this is very late and after the fact, but this is a quick note to say that i have finally been able to implement the Trie structure you recommended and it worked perfectly! Thank you very much for your help. I've linked my solution below so please feel free to check it out – CSStudent19583 Jan 09 '22 at 00:31

1 Answers1

0

Because you mentioned the input regexes have the structure of word+ a simple word and a plus regex symbol, you can use a modified version of the optimal Aho-Corasick algorithm, in this algorithm you make a finite-state-machine from in search patterns that can easily be modified accept some regex signs easily, like in your case a very simplistic solution would be to pad your keys to length of longest words in the list and accept anything that comes after the padding, there is an easy to implement wildcards are '.' and '?', for * one have to go to end of the word or return and follow the other path through list, which can be exponential number of choices in constant memory(any and all deterministic finite automatons).

A finite state machine for a list of your regex keys can be made in linear time, meaning it takes time proportional to the sum of length of your dictionary keys. as explained here there is also support for longest matched word in the dictionary.

FazeL
  • 916
  • 2
  • 13
  • 30
  • Given that the "patterns" are all prefixes, wouldn't the finite state machine naturally end up in the shape of a prefix tree? – Stef Sep 04 '21 at 08:28
  • @Stef Important point, kudos, even when the patterns aren't prefixes the FSM is gonna end up in the form of a prefix tree, it just has a bunch of extra edges to prevent restart and keep one from rematching the currently matched characters, moves on if possible to a path is already with previous characters and the just consumed one. So it essentially is a prefix tree with some extra auxiliary edges that help in some queries. – FazeL Sep 05 '21 at 11:57