14

I have been looking for a way to match balanced parenthesis in a regex and found a way in Perl, that uses a recursive regular expression:

my $re;
$re = qr{
           \(
              (?:
                 (?> [^()]+ )       # Non-parens without backtracking
                 |
                 (??{ $re })        # Group with matching parens
              )*
           \)
         }x;

from the perl regular expression site .

Is there a way to do this in Ruby or a similar language?

UPDATE:

For those interested here are some interesting links:

Oniguruma manual - from Sawa's answer.

Pragmatic Programmers' Ruby 1.9 Regular Expressions Sample Chapter

Community
  • 1
  • 1
Yet Another Geek
  • 4,251
  • 1
  • 28
  • 40
  • Oniguruma documentation link is not available anymore. Better source is probably: https://github.com/kkos/oniguruma/blob/master/doc/RE – Gabriel Mazetto Apr 29 '20 at 15:18
  • I'm constantly amazed by the fact that language designers fail to learn from Perl - an amazing language with a somewhat weird syntax. – John Deighan Jun 28 '22 at 18:07

2 Answers2

20

Yes. With oniguruma regex engine, which is built in in Ruby 1.9, and is installable on Ruby 1.8, you can do that. You name a subregex with (?<name>...) or (?'name'...). Then you call a subregex with \g<name> or \g'name' within the same regex. So your regex translated to oniguruma regex will be:

re = %r{
  (?<re>
    \(
      (?:
        (?> [^()]+ )
        |
        \g<re>
      )*
    \)
  )
}x

Also note that multi-byte string module in PHP >=5 uses oniguruma regex engine, so you will be able to do the same.

The manual for oniguruma is here.

sawa
  • 165,429
  • 45
  • 277
  • 381
  • 4
    Then somebody better update [the Wikipedia page](http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines). Still, I much prefer [this simpler answer](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579) of using `\((?:[^()]*+|(?0))*\)`. – tchrist Jun 13 '11 at 14:31
  • @tchrist I just changed recursion, lookbehind, atomic groups, named capture, and unicode property support to yes for Ruby in the feature comparison tables, and removed the note that it was for Ruby 1.8. – sawa Jun 13 '11 at 14:56
  • I should have probably read this page which I bookmarked http://media.pragprog.com/titles/ruby3/ruby3_extract_regular_expressions.pdf – Yet Another Geek Jun 13 '11 at 15:38
  • @Yet Another Geek Good link. Probably around pages 127, 128. – sawa Jun 13 '11 at 16:30
  • 2
    @sawa: Good deed! But you’d best change Uɴɪᴄᴏᴅᴇ prop support to `Some`: Oniguruma supports only Gᴇɴᴇʀᴀʟ_Cᴀᴛᴇɢᴏʀʏ + a few Sᴄʀɪᴘᴛs. Uɴɪᴄᴏᴅᴇ prop support has 4⁺ levels: ⑴ [all 11 props required by RL1.2](http://unicode.org/reports/tr18/#Categories); ⑵ compat props like on `\w` &ᶜ [per RL1.2A](http://unicode.org/reports/tr18/#Compatibility_Properties); ⑶ named chars like `\N{POUND SIGN}` [per RL2.5](http://unicode.org/reports/tr18/#Name_Properties); and ⑷ full support of *all props* [per RL2.7](http://unicode.org/reports/tr18/proposed.html#Full_Properties). **Perl&ICU meet all ④; Ruby meets ⓪.** – tchrist Jun 13 '11 at 16:59
  • @tchrist Thanks for the information. I actually don't know to that extent. Could you fix it for me? – sawa Jun 13 '11 at 17:03
  • 3
    @sawa Gladly. Oniguruma has a many interesting features, but it also has major blunders, like dealing with physical serializations (encodings) at the regex level instead of always in virtual code points. This is a major headache, and a major violation: [Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for useful Unicode support.](http://unicode.org/reports/tr18/) See? – tchrist Jun 13 '11 at 17:09
  • 1
    @sawa eventhough your answer was very helpful you should probably do this small correction: change `(??\g)` to `(\g)` as the `(?? {})` is the perl dynamic recursion operator (or something like that). – Yet Another Geek Jun 14 '11 at 06:28
  • @Yet Another Geek Sorry about it. Thanks for pointing it out. – sawa Jun 14 '11 at 07:57
1

I like the above solution but frequently one wishes to ignore escaped characters. Assuming that \ escapes the following character the following regex handles escaped characters as well.

ESC= /(?<![\\])(?>[\\](?:[\\][\\])*)/
UNESC= /(?:\A|(?<=[^\\]))(?:[\\][\\])*/
BALANCED_PARENS = /#{UNESC}(
                   (?<bal>\(
                    (?>
                      (?>  (?:#{ESC}\(|#{ESC}\)|[^()])+     )
                      |\g<bal>
                    )*
                    \))    ) /xm

Given the limitations of negative lookbehind the part delimited by matching parens will be the first capture not the whole match (the whole match may contain leading escaped backslashes).

The reason for the complexity of ESC and UNESC is the assumption that a \\ is an escaped backslash. We only use the UNESC sequence before the initial paren match since any other escaped parenthesis will be matched inside the atomic group and never backtracked. Indeed, if we tried to use the UNESC prefix for either an interior or final paren match it would fail when [^()] inside the atomic group matched the leading \'s and refused to backtrack.

This regex will scan for the first paren that delimits a validly balanced parenthetical. Thus, given the string " ( ( stuff )" it will match "( stuff )". Frequently, the desired behavior is to locate the first (unescaped) parenthesis and either match the interior (if balanced) or fail to match. Unfortunately, atomic grouping won't stop the entire regex from being backed out of and a match attempted at a later point so we must anchor at the start of the string and only look at the 1st capture. The following regex makes this change:

BALANCED_PARENS = /\A(?:#{ESC}\(|#{ESC}\)|[^()])*+
                  (?<match>\(
                   (?<bal>
                    (?>
                      (?>  (?:#{ESC}\(|#{ESC}\)|[^()])+     )
                      |\(\g<bal>
                    )*
                    \))    ) /xm
Peter Gerdes
  • 2,288
  • 1
  • 20
  • 28