2

I'm having an issue when using re.search() or re.findall(). But let me explain first. I'm searching through C-Code, which contains definitions like this:

#define SOME_FUNCTION( someArgument, someMoreArgument)

Sometimes, when they have a lots of arguments, they are wrapped, so they look like this:

#define ANOTHER_FUNCTION( moreArgument, evenMoreArguments,
                          anAwfulLotOfArguments, youGetIt)

Now I'm using python to scan through these files to find every single one of these definitions. My regex looks like follows:

#define [A-Z,_]*\((?:.|\s)+?\"

When testing this with single- and multiline definitions in testing engines like regex101.com it works flawlessly.

However, when I use this with python (3.4.1), it only works for single-lined definitions. When it tries to scan multiline definitions, it just stops executing (although I can interrupt it with Ctrl + C). I tried using :

regexFullMacro = re.compile("#define [A-Z,_]*\((?:.|\s)+?\)")
match = regexFullMacro.search(searchString)

as well as

regexFullMacro = re.compile("#define [A-Z,_]*\((?:.|\s)+?\)")
match = regexFullMacro.findall(searchString)

where both tries just stop responding when it comes to multiple lines.

Has anyone had this issue before or am I just utterly stupid and missing something obvious?

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Pixel-Pi
  • 85
  • 6
  • When you test on regex101 are you using the full contents of a source file as sample? You probably have an issue with backtracking but this would not be noticeable if you only test with short strings... – Giacomo Alzetta Sep 19 '18 at 12:55
  • 3
    An obvious thing is to NEVER use `(?:.|\s)+?` inside a regex pattern. It causes so heavy backtracking that the engine gets stuck. You always want `.` with a DOTALL modifier or `[\s\S]` when it is required to only apply this behavior for a part of the pattern. In this case, `"#define [A-Z,_]*\([^()]+\)"` should be enough since negated character class `[^()]` matches any char but `(` and `)` including newlines. – Wiktor Stribiżew Sep 19 '18 at 12:55
  • @WiktorStribiżew Right . Use `[^)]+` instead in that place, this wont do any backtracking. – Giacomo Alzetta Sep 19 '18 at 12:56
  • Thanks for the quick response. I used the solution provided by @WiktorStribiżew and it worked flawlessly. If I could ask for a favor, could you please describe a bit, what [^()]+ does better than (?:.|\s)+? and why that solution does not create any backtracking? That would be awesome – Pixel-Pi Sep 19 '18 at 13:29

1 Answers1

1

Solution

r"#define [A-Z,_]*\([^()]+\)"

The negated character class [^()] matches any char but ( and ) (including newlines) and the + quantifier makes the regex engine match 1 or more consecutive occurrences of them.

Root cause

The (?:.|\s)+? inside a regex pattern (when there are other subpatterns to follow) causes too much expansion operations (it is a lazy pattern, and we can only talk about backtracking with greedy modifiers) that freezes the regex engine. The main problem with this alternation group is that . can match what \s matches. When alternations in a group can match at the same location, it is always a bottleneck as once one alternative fails, the other is tried and goes on trying many more times if the group is quantified (as here).

You always want . with a DOTALL modifier or [\s\S] instead when it is required to only apply this behavior for a part of the pattern (in Python re, the [\S\s] workaround is very helpful as re does not support modifier groups).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563