1

What would be the Regex to grab the innermost set of parentheses containing a specific character; '|' in this case?

Some examples and a (c#) test method:

string[] tests = {
    "x () y", "",
    "x (a) y", "",
    "x (a.b()) y", "",
    "x ((a).b() | (b).c()) y", "(a).b() | (b).c()",
    "x (a|b) y", "a|b",
    "x ((a|b) | c)", "a|b",
    "x (a|b|c) y", "a|b|c",
    "x (a|a.b()|c) y", "a|a.b()|c",
    "x (a.b()|b.c()) y", "a.b()|b.c()",
    "x (a.b()|b.c()|c) y", "a.b()|b.c()|c",
    "x (a|b.c()|c.d()) y", "a|b.c()|c.d()",
    "x (a|(b.c()|d)) y", "b.c()|d",
    "x (a|a.b(a)|c) y", "a|a.b(a)|c"
};

for (int i = 0; i < tests.Length; i+=2)
{
    var match = re.Match(tests[i]);
    var result = match.Groups[1].Value;
    Assert.That(result, Is.EqualTo(tests[i + 1]));
}
Konamiman
  • 49,681
  • 17
  • 108
  • 138
rednoyz
  • 1,318
  • 10
  • 24

2 Answers2

3

The "very simple" regex that solves all the tests:

var re = new Regex(@"
(?:\()
(
    (?>
        (?:  
            (?<p>\()  |  
            (?<-p>\))  |  
            [^()|]+  |  
            (?(p)(?!))(?<pipe>\|)  
        )*  
    )    
)
(?:\))
(?(p)(?!))
(?(pipe)|(?!))", RegexOptions.IgnorePatternWhitespace);

string result = match.Groups[1].Value;

Note the use of RegexOptions.IgnorePatternWhitespace. The regex is based on balancing groups. Based on the fact that you shouldn't try to use a regex you don't fully comprehend, I'll leave out the exact explanation about how it works. I'll only say that the check (?(pipe)|(?!)) checks if at least a | was captured in the capture, while (?(p)(?!)) means "there are no still open parentheses that were captured by the (?<p>\() expression".

My opinion on this regex is that it s a futile and dangerous exercise in regexes! (if it isn't clear I'm of the Some people, when confronted with a problem, think "I know, I'll use regular expressions" Now they have two problems. school of thought). You shouldn't use it. It is an undebuggable code horror.

Additional thing: this regex backtracks heavily... added (?> ... ) to disable backtracking.

Additional tests for backtracking (the first one has unbalanced parenthesis):

"((((amusemen).emoadj().cap()(, (are we |arent we|I gather)|)?)", "are we |arent we|I gather",
"((amusemen).emoadj().cap()(, (are we |arent we|I gather)|)?)", "are we |arent we|I gather",
xanatos
  • 109,618
  • 12
  • 197
  • 280
  • This regex does indeed work for the set of test inputs. However, with other inputs (see [this case](https://gist.github.com/dhowe/eaec94b214c88e0e5a79ae017ba460a6) for example) it cause my system to hang for up to 30 seconds. Can someone verify they get the same behavior? Seems very strange – rednoyz May 17 '18 at 10:21
  • @rednoyz Unbalanced parenthesis (there are 8x ( and 6x ) ) – xanatos May 17 '18 at 10:24
  • @rednoyz Added `(> ... )` to disable backtracking. Now it is instant both with the "wrong" string you proposed and a corrected version of that string. – xanatos May 17 '18 at 10:31
0

The following method is probably overcomplicated and has room for optimization, but you can use it as a starting point it in case you can't find a better alternative. I have added some basic parenthesis balancing checking.

The method passes all your use cases, but there are a couple of interesting ones that you haven't considered; this method solves them as follows:

  • a|b (character found outside parenthesis): ignored, so this returns an empty string
  • (a|b) c|d: returns a|b (same as above)
  • (a|b) (c|d) (more than one set matching): returns c|d (the last one found)
string FindInnermostSet(string source, char toSearch = '|')
{
    var candidateOpeningParenthesisPosition = -1;
    var candidateClosingParenthesisPosition = -1;
    var candidateOpeningParenthesNestingLevel = -1;
    var openingParenthesisPositions = new Stack<int>();

    for(int i=0; i<source.Length; i++)
    {
        var currentChar = source[i];
        if(currentChar == '(') {
            openingParenthesisPositions.Push(i);
        }
        else if (currentChar == ')')
        {
            if(!openingParenthesisPositions.Any())
                throw new Exception("Syntax error: too many ')'");
            if(openingParenthesisPositions.Count() == candidateOpeningParenthesNestingLevel)
                candidateClosingParenthesisPosition = i;
            openingParenthesisPositions.Pop();              
        }
        else if(currentChar == toSearch && 
                openingParenthesisPositions.Any() &&
                openingParenthesisPositions.Count() >= candidateOpeningParenthesNestingLevel)
        {
            candidateOpeningParenthesNestingLevel = openingParenthesisPositions.Count();
            candidateOpeningParenthesisPosition = openingParenthesisPositions.Peek();
        }
    }

    if(openingParenthesisPositions.Any())
        throw new Exception("Syntax error: missing ')'");

    if(candidateOpeningParenthesisPosition == -1)
        return "";

    return source.Substring(
        candidateOpeningParenthesisPosition+1,
        candidateClosingParenthesisPosition-candidateOpeningParenthesisPosition-1);
}
Konamiman
  • 49,681
  • 17
  • 108
  • 138