2

I have a rather complex regex search in place, and I recently had to extend it to a "reversed" pattern. This was easy to implement, but performance dropped roughly by a factor of 10.

I would appreciate any tips on how to improve this problem, but I'm not particular about how. It's ok, for example, to use two or more steps if this is faster than a single, very involved one.

Original pattern

import regex

expression = regex.compile(
    fr"""
(?:{CUE})  # cue before targets. non capture (?:)
(?:{PADDING})  # text before the match.
(?:
(?:{ITEM_SEPARATION})?  # none or once
(?P<targets>{targets_as_regex})
)+
""",
    regex.VERBOSE,
)

Need to know: CUE is a rather short list of options, something like:

CUE = r"""word
|two\swords
|simple\soptions?
"""

This is about 5-15 options, quite simple (a bit simpler for the reversed scenario).

PADDING is literally anything, but no longer than 150. (PADDING = r".{0,150}?"). In the reversed scenario it's a bit more restrictive, only [,A-Za-z\d\s]

ITEM_SEPARATION is simple: optional spaces followed by either comma, bounded " y " and more optional spaces:

ITEM_SEPARATION = r"""
\s*          # optional spaces
(?:,|\by\b)  # non-capture, either comma, or 'y' (with bounds) 
\s*          # optional spaces
"""

Finally, the problem is likely in targets_as_regex, which is a big list of bounded names. This can easily be hundreds or thousands of options, although each call will pass a unique list of targets.

The idea is to match something like:

This is a text, with a cue, then a bit more text and a match, other match y final match. And after the "." we stop matching.

And it performs sufficiently well. But then we have the reversed situation:

Reversed pattern

The idea is to match this:

This not a match, because of the stop. This is a match, another match y final match, because now we find a cue. Etc.

I naturally recycled my pattern:

expression = regex.compile(
    fr"""
(
(?P<targets>{targets_as_regex})
(?:{ITEM_SEPARATION})?  
)+
(?:{SIMPLE_PADDING})  # text before the match
(?:{CUE_AFTER})  # cue after targets
""",
    regex.VERBOSE,
)

It works as expected, but performance is very poor when targets (that are many) come before cues (which are few).

What I tried

I added a basic check where we omit the whole thing if the cue is not in the text at all, but it did not help very much. Another idea that I have tested is to search all matches for the targets (regardless of cues and all that) and then pass a shortened list (targets_as_regex) with only these matches to the slow pattern. This actually helps, but not by much (still close to that x10 drop in performance).

Any other ideas that could help?

Pablo
  • 1,373
  • 16
  • 36
  • I would assume that the performance drop comes from a problem of type `.*.*` in regex that has to check a lot of possibilities, more when its reversed. Like `test.*` would be more efficient than `.*test` (I know the result is not the same either). Yet I cannot really find out right now. Would have to give a look later, but that's my 2 cents – jkoestinger May 18 '22 at 10:03
  • `((?P{targets_as_regex})(?:{ITEM_SEPARATION})?)+` is a well-known catastrophic backtracking prone pattern (analogous to `(a+b?)+` which is reduced to `(a+)+`), and the more to the left of the pattern, the more dangerous. – Wiktor Stribiżew May 18 '22 at 10:04
  • 1
    It should be written as `fr"""(?:{targets_as_regex})(?:{ITEM_SEPARATION}(?:{targets_as_regex}))*(?:{SIMPLE_PADDING})(?:{CUE_AFTER})"""`. Make sure `targets_as_regex` are built into a regex trie. – Wiktor Stribiżew May 18 '22 at 10:12
  • The backtracking problem did not really matter after all (your alternative works, at about the same speed). But the trie thing blew my mind. I'll have to do some final checks, but it looks like it is about twice FASTER than the "fast" pattern in my example, which also benefits from the trie itself without changing a thing! If you want to write this as answer I'll gladly accept it (otherwise I will write it after a few days to help others who find it) – Pablo May 18 '22 at 14:14

1 Answers1

2

The ((?P<targets>{targets_as_regex})(?:{ITEM_SEPARATION})?)+ is a well-known catastrophic backtracking prone pattern (analogous to (a+b?)+ which is reduced to (a+)+), and the more to the left of the pattern, the more dangerous.

You need to "unroll" patterns like (a+b?)+ into a+(?:ba+)* making the subsequent subpatterns match only at different positions inside the string.

The pattern you need is

fr"""(?:{targets_as_regex})(?:{ITEM_SEPARATION}(?:{targets_as_regex}))*(?:{SIMPLE_PADDING})(?:{CUE_AFTER})"""

To speed it up, you need a regex trie built from the targets_as_regex, see Speed up millions of regex replacements in Python 3.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563