0

Using python as my scripting language, this is the regex in question: [\(\[](\w+ ?)+[\)\]]

which basically would match anything within a set of parenthesis or brackets, e.g., (this would match); [this would match]; (this) and (this) would also match.

The expression works fine when working with one-off string matches; however, when I utilize it as a pattern in a broader text processing pipeline, it tremendously slows down the process. If I remove that one pattern, a dataframe of 77k+ rows processes almost instantly. With the above pattern, it is estimated to be taking about 2 hours.

What's going on here? I've tried removing the brackets and just looking for parens, which seems to have sped things up a tad, but this just doesn't make any intuitive sense.

NOTE: this similar expression [\(\[].+[\)\]] works as fast as expected, but is too aggressive in what it would remove. The above example of (this) and (this) would remove everything between the first and last bracket, resulting in an empty string.

EDIT: A detailed explanation was shared at this duplicate question (Fixing Catastrophic Backtracking in Regular Expression), however, the responders below helped address the specifics of my question.

Victor Vulovic
  • 521
  • 4
  • 11
  • maybe because string operation in pandas aren't enhanced with Cython… and vectorization would not make any change to loop over individuals. – adir abargil Dec 08 '21 at 18:00
  • 4
    Unfortunately `[this would match too)`. The problem with this regexp is that it has no real separators, where to start or end the matching, it has way too many options to check it. If you could enter some separation it would improve the parsing speed imensely. – Florin C. Dec 08 '21 at 18:01
  • 3
    Repeating something that is itself repeated (the two `+` in your regex) generally results in match times that go up exponentially with length. Try `[\w ]+` instead of `(\w+ ?)+` for the inner part of the expression. – jasonharper Dec 08 '21 at 18:06
  • thanks @FlorinC, that is a helpful explanation – Victor Vulovic Dec 08 '21 at 18:23
  • @jasonharper, your suggestion helped me solve the problem! Thanks! – Victor Vulovic Dec 08 '21 at 18:23

1 Answers1

1

Have you tried it with a non-greedy qualifier?

[\(\[].+?[\)\]]
vaizki
  • 1,678
  • 1
  • 9
  • 12