I have a large dictionary containing regex values as the key and a numeric value as a value, and given a corpus (broken down into a list of individual word tokens) I would like to to find the regex value that best matches my word to obtain its respective value.
The dictionary contains many regex values that are ambiguous, in the sense that a word may have multiple regex matches, and therefore you would want to find the longest regex or 'best match' (ex: dictionary contains affect+, as well as affected an affection)
My issue is when running a large text sample through the dictionary and finding the regex match of each word token, it takes a long amount of time (0.1s per word), which obviously adds up over 1000's of words. This is because it goes through the whole dictionary each time to find the 'best match'.
Is there a faster way to achieve this? Please see the problematic part of my code below.
for word in textTokens:
for reg,value in dictionary.items():
if(re.match(reg, word)):
matchedWords.append(reg)