4

I'm trying to write a recursive regular expression to capture code blocks, but for some reason it seems to not be capturing them properly. I would expect the code below to capture the full body of the function, but instead it only captures the contents of the first if statement.

It's almost like the .+? is somehow gobbling up the first {, but it's supposed to be non-greedy so I don't understand why it would.

What is causing it to act this way?

Script:

use strict;
use warnings;

my $text = << "END";
int max(int x, int y)
{
    if (x > y)
    {
        return x;
    }
    else
    {
        return y;
    }
}
END

# Regular expression to capture balanced "{}" groups
my $regex = qr/
    \{              # Match opening brace
        (?:         # Start non-capturing group
            [^{}]++ #     Match non-brace characters without backtracking
            |       #     or
            (?R)    #     Recursively match the entire expression
        )*          # Match 0 or more times
    \}              # Match closing brace
/x;

# is ".+?" gobbling up the first "{"?
# What would cause it to do this?
if ($text =~ m/int\s.+?($regex)/s){
    print $1;
}

Output:

{
        return x;
    }

Expected Output:

{
    if (x > y)
    {
        return x;
    }
    else
    {
        return y;
    }
}

I know that there is a Text::Balanced module for this purpose, but I am attempting to do this by hand in order to learn more about regular expressions.

tjwrona1992
  • 8,614
  • 8
  • 35
  • 98

2 Answers2

6

(?R) recurses into the whole pattern – but what is the whole pattern? When you embed the quoted $regex into /int\s.+?($regex)/, the pattern is recompiled and (?R) refers to the new pattern. That's not what you intended.

I'd recommend you use named captures instead so that you can recurse by name. Change the $regex like

/(?<nestedbrace> ... (?&nestedbrace) ...)/

If you want to avoid extra captures, you can use the (?(DEFINE) ...) syntax to declare named regex patterns that can be called later:

my $define_nestedbrace_re = qr/(?(DEFINE)
  (?<nestedbrace ... (?&nestedbrace) ...)
)/x;

Then: /int\s.+?((?&nestedbrace))$define_nestedbrace_re/

That won't create additional captures. However, it is not generally possible to write encapsulated regex fragments. Techniques like preferring named captures instead of numbered captures can help here.

amon
  • 57,091
  • 2
  • 89
  • 149
  • that works, but adds another problem. if I change my `if` statement to print `$1` and `$2` it will now print something for `$2`, but the way I'm going to use this pattern having it capture a group internally can make the code that uses it confusing. – tjwrona1992 Sep 08 '17 at 14:12
  • I basically want to capture the codeblock and leave `$2` undefined so it only captures the groups that I explicitly ask for. – tjwrona1992 Sep 08 '17 at 14:13
  • @tjwrona1992 see my update – by using `(?(DEFINE) ...)` blocks you can structure your regex to avoid extra captures. – amon Sep 08 '17 at 14:22
  • Can you explain the `((?&nestedbrace))`? Why is that needed before using `$define_nested_brace_re? – tjwrona1992 Sep 08 '17 at 14:26
  • @tjwrona1992 The `$define_nested_brace_re` pattern only defines a named subpattern. It does not match it by itself and does not capture the results. The `(?(DEFINE) …)` is very much like a `sub` declaration. You could rewrite the pattern to define the pattern and also run it, but I find that defining and invoking subpatterns separately is more maintainable for large regexes. – amon Sep 08 '17 at 14:36
  • For this scenario I think defining and running it in one would be simpler as far as readability, but trying that I end up having another problem. I added an extra line to the text saying `more stuff` or something else random, then changed my expression to `if ($text =~ m/int\s(.+?)$regex(.+)/s){ print $1; print $2; }` but after executing it I get an error saying `$2` is undefined. – tjwrona1992 Sep 08 '17 at 15:14
  • There is nothing obviously wrong with that regex. This might be an unrelated problem, and these comments are not a suitable place to debug it. – amon Sep 08 '17 at 15:34
  • Okay I think I may need to post a new question for this new problem, because it is really too much to describe here. – tjwrona1992 Sep 08 '17 at 15:47
1

You can change your recursive pattern to this one:

/int\s+.*?  (
    \{              # Match opening brace
        (?:         # Start non-capturing group
            [^{}]++ # Match non-brace chars without backtracking
            |       # OR
            (?-1)   # Recursively match the previous group
        )*          # Match 0 or more times
    \}
)/sx
  • Note use of (?-1) instead of (?R) that recurses whole matched pattern.
  • (?-1) is back-reference of previous capturing group.

Updated RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • This also works, but just like the other solution it causes the recursive pattern to capture an extra group that is visible from the outer pattern. so `if ($text =~ m/int\s.+?($regex)/s)` will now capture something for `$1` as expected but also captures a value for `$2`. – tjwrona1992 Sep 08 '17 at 14:15
  • The way this will be used I may want to capture a group before it, and after it like so `if ($text =~ m/int\s(.+?)$regex(.+?)/s) { print "$1, $2"; }` but now the second capture group will really be in `$3` which is really confusing. – tjwrona1992 Sep 08 '17 at 14:16
  • Sorry I didn't get. There is only one capturing group here: https://regex101.com/r/lJ4bXk/3/ – anubhava Sep 08 '17 at 14:16
  • 2
    I think what I'm trying to ask in these comments should probably be a whole new question. These answers do what I asked in this question. – tjwrona1992 Sep 08 '17 at 14:19