Superpower parser for balanced nested parentheses

Question

I'm struggling to come up with a superpower parser for the set of partial inputs below (nested, balanced parentheses with '|' separator).

Arbitrary text can go inside the parens, including whitespace, other tokens, and "()". Only '|', '(', ')', should have special meaning here (a newline would also end the sequence). To be valid, each balanced, parenthesized group must have a '|' and at least one character that is not '(' or ')'.

Ideally the parser would split each input into a list, with elements either a (terminal) string, or an array of strings, as follows:

Valid:

(a|)                    ->      { "a", "" }
(a | b)                 ->      { "a", "b" }
(a | b.c())             ->      { "a", "b.c()" }
(aa | bb cc )           ->      { "aa" "bb cc" }
(a | b | c #dd)         ->      { "a", "b", "c #dd"}
((a | b) | $c)          ->      { { "a", "b" }, "$c" }
((a | b) | (c | d))     ->      { { "a", "b" }, { "c", "d" } }
(((a | b) | c) | d)     ->      { { { "a", "b" }, "c" }, "d" }
...

Invalid/ignored:

()
())
(()
(|)
(|())
(.)
(())
(()|())
(abc)
(a bc)
(a.bc())
...

My tokens (for the purposes here) are as follows:

public enum Tokens
{        
    [Token(Example = "(")]
    LParen,

    [Token(Example = ")")]
    RParen,

    [Token(Example = "|")]
    Pipe,

    [Token(Description = "everything-else")]
    String
}

Yes (and you can see the tokens in the source above...) thanks — rednoyz, Oct 10 '18 at 04:05
what kind of output are you looking for? I'm assuming you need each paren group to be parsed into an object that contains info about what's inside, but could you show the specific classes that you want the data parsed into? that would help — jtate, Oct 10 '18 at 12:08
So you want the original string? Could you update the question with some additional details about this with some examples of ouptut that you're looking for? — jtate, Oct 11 '18 at 12:27
I've added outputs above per your suggestion @jtate. Ideally the parser would split each input into a list, the element of which being either a (terminal) string, or an array of child strings (recursively) — rednoyz, Oct 11 '18 at 15:40
problem is I don't know how to count parens in the tokenizer to verify they are balanced — rednoyz, Oct 16 '18 at 08:28
Given the superpower description page at github, have a look at `static readonly TokenListParser Factor = [...]`, it shows how to express the expectation of a left and right parenthesis with a sub-expression in between. Point is: you shouldn't count parens, you should expect them to appear and error out if your expectation is not met by reality. However, your grammar has some irregular requirements that will provide you with additional challenges, like the invalid `(|)` expression, when `(a|)` and `(|a)` are allowed... — grek40, Oct 16 '18 at 21:32
edge-cases like `(|)` can be handled as an error condition later -- are there other 'irregular requirements' in the list above I should be aware of? — rednoyz, Oct 17 '18 at 02:52

jtate · Accepted Answer · 2018-10-21T14:21:10.117

This was tricky, mostly because of the whitespace you need to preserve, but I was able to come up with a parser that meets your needs. First, I had to alter your Tokens enum slightly:

public enum Tokens
{
    None,
    String,
    Number,

    [Token(Example = "()")]
    OpenCloseParen,

    [Token(Example = "(")]
    LParen,

    [Token(Example = ")")]
    RParen,

    [Token(Example = "#")]
    Hash,

    [Token(Example = "$")]
    Dollar,

    [Token(Example = "|")]
    Pipe,

    [Token(Example = ".")]
    Dot,

    [Token(Example = " ")]
    Whitespace,
}

Next, we can build the following Tokenizer:

var tokenizer = new TokenizerBuilder<Tokens>()
    .Match(Span.EqualTo("()"), Tokens.OpenCloseParen)
    .Match(Character.EqualTo('('), Tokens.LParen)
    .Match(Character.EqualTo(')'), Tokens.RParen)
    .Match(Character.EqualTo('#'), Tokens.Hash)
    .Match(Character.EqualTo('$'), Tokens.Dollar)
    .Match(Character.EqualTo('.'), Tokens.Dot)
    .Match(Character.EqualTo('|'), Tokens.Pipe)
    .Match(Character.EqualTo(' '), Tokens.Whitespace)
    .Match(Span.MatchedBy(Character.AnyChar), Tokens.String)
    .Match(Numerics.Natural, Tokens.Number)
    .Build();

Next, create model classes to hold the output (you can probably think of better names for these, as I'm not sure really what this data is that you are parsing):

public abstract class Node
{
}

public class TextNode : Node
{
    public string Value { get; set; }
}

public class Expression : Node
{
    public Node[] Nodes { get; set; }
}

Then we create the parsers:

public static class MyParsers
{
    /// <summary>
    /// Parses any whitespace (if any) and returns a resulting string
    /// </summary>
    public readonly static TokenListParser<Tokens, string> OptionalWhitespace =
        from chars in Token.EqualTo(Tokens.Whitespace).Many().OptionalOrDefault()
        select chars == null ? "" : new string(' ', chars.Length);

    /// <summary>
    /// Parses a valid text expression
    /// e.g. "abc", "a.c()", "$c", etc.
    /// </summary>
    public readonly static TokenListParser<Tokens, Node> TextExpression =
        from tokens in
            Token.EqualTo(Tokens.OpenCloseParen)
            .Or(Token.EqualTo(Tokens.Hash))
            .Or(Token.EqualTo(Tokens.Dollar))
            .Or(Token.EqualTo(Tokens.Dot))
            .Or(Token.EqualTo(Tokens.Number))
            .Or(Token.EqualTo(Tokens.String))
            .Or(Token.EqualTo(Tokens.Whitespace))
            .Many()
        // if this side of the pipe is all whitespace, return null
        select (Node) (
            tokens.All(x => x.ToStringValue() == " ") 
            ? null
            : new TextNode {
                Value = string.Join("", tokens.Select(t => t.ToStringValue())).Trim()
            }
        );

    /// <summary>
    /// Parses a full expression that may contain text expressions or nested sub-expressions
    /// e.g. "(a | b)", "( (a.c() | b) | (123 | c) )", etc.
    /// </summary>
    public readonly static TokenListParser<Tokens, Node> Expression =
        from leadWs in OptionalWhitespace
        from lp in Token.EqualTo(Tokens.LParen)
        from nodes in TextExpression
            .Or(Parse.Ref(() => Expression))
            .ManyDelimitedBy(Token.EqualTo(Tokens.Pipe))
            .OptionalOrDefault()
        from rp in Token.EqualTo(Tokens.RParen)
        from trailWs in OptionalWhitespace
        where nodes.Length > 1 && nodes.Any(node => node != null) // has to have at least two sides and one has to be non-null
        select (Node)new Expression {
            Nodes = nodes.Select(node => node ?? new TextNode { Value = "" }).ToArray()
        };
}

And finally we can use the tokenizer along with the parsers to parse your input:

string input = "(a b | c.())";
var tokens = tokenizer.Tokenize(input);

var result = MyParsers.Expression.TryParse(tokens);
if (result.HasValue)
{
    // input is valid
    var expression = (Expression)result.Value;

    // do what you need with it here, i.e. loop through the nodes, output the text, etc.
}
else
{
    // not valid
}

This works for pretty much all of your test cases except the ones like this (()|()) where the open/close paren is the value on either side of a pipe. There is also probably a better way to do some of the parsing, as I'm just getting used to Superpower myself, but I think this is a good base to start with so you can optimize it and/or integrate all your edge cases into it.

EDIT

It was the whitespace that was messing everything up. I had to add more whitespace checks within the Expression parser, and also had to add a condition to check for a non-empty TextExpression and then also check for one that may be empty. This was to handle cases where one side of the pipe is blank. Here is the working parser:

public readonly static TokenListParser<Tokens, Node> Expression =
    from _1 in OptionalWhitespace
    from lp in Token.EqualTo(Tokens.LParen)
    from _2 in OptionalWhitespace
    from nodes in 
        TextExpression.Where(node => node != null) // check for actual text node first
        .Or(Expression)
        .Or(TextExpression) // then check to see if it's empty
        .ManyDelimitedBy(Token.EqualTo(Tokens.Pipe))
    from _3 in OptionalWhitespace
    from rp in Token.EqualTo(Tokens.RParen)
    from _4 in OptionalWhitespace
    where nodes.Length > 1 && nodes.Any(node => node != null) // has to have at least two sides and one has to be non-null
    select (Node)new Expression {
        Nodes = nodes.Select(node => node ?? new TextNode { Value = "" }).ToArray()
    };

Thanks for this @jtate. It works for some of the inputs, but doesn't seem to parse nested parens, for example `((a | b) | c)`. You can find the full class [here](https://gist.github.com/dhowe/8dfc420fb8b59e613226b57c06627a0a) — rednoyz, Oct 18 '18 at 11:30
It definitely works for me with nested inputs, as that was one of the main things I tested. When you say it doesn't parse them, do you mean that `TryParse` fails or are you saying the output isn't correct? — jtate, Oct 18 '18 at 12:47
Odd, it's not working for me now either. I'll take a look at it and see. Although I don't have access to my source right now, I'll use the one you provided. Maybe something was lost in translation when I was creating my answer. — jtate, Oct 18 '18 at 12:54
This is great, thanks much (accepted/upvoted). If you have a moment, check [this one](https://stackoverflow.com/questions/52706549/superpower-match-a-string-with-tokenizer-or-parser-only-if-it-begins-a-line) as well... — rednoyz, Oct 22 '18 at 15:12

Superpower parser for balanced nested parentheses

1 Answers1

EDIT