1

I've just started writing a simple parser in C# that you can find here: https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L1078

The main feature being that I wanted to be able to write a JSON configuration for some grammar that I want to analyze, load up the configuration at runtime and evaluate some string input via the lexer/parser and generate a tree with information about each node for things like syntax highlighting and extracting variables from sections of a parsed text, rather than using a third party tool like a yacc/bison clone to create code to be compiled from a grammar, for some languages like SQL/CSS/HTML.

The simple question is, what are some other good methods of detecting recursion problems early from malformed input before a StackOverflowException? The details are below.

I don't have grammars fully built out yet for various languages, just simple tests of some basic grammar so far, but I ran across a problem when trying to implement EBNF like: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form when having to deal with how the right-hand side of grammar statements are evaluated like so:

rhs = identifier
     | terminal
     | "[" , rhs , "]"
     | "{" , rhs , "}"
     | "(" , rhs , ")"
     | rhs , "|" , rhs
     | rhs , "," , rhs ;

So in my JSON configuration it is set up like the following: https://github.com/JohnnyErnest/LexerParser/blob/main/Configuration/ParserEBNF.json#L178 accounting also for white-spaces.

"rhsBlock1": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "bracketOpen" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "bracketClose" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhsBlock2": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "braceOpen" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "braceClose" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhsBlock3": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "parenthesisOpen" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "parenthesisClose" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhsBlockPipe": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "pipe" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhsBlockComma": {
  "sequence": [
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "tokenList": "comma" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    },
    { "sequenceList": "rhs" },
    {
      "tokenList": "whitespaces",
      "isOptional": "true"
    }
  ]
},
"rhs": {
  "sequence": [
    { "sequenceList": "identifier,ebnfTerminal,rhsBlock1,rhsBlock2,rhsBlock3,rhsBlockPipe,rhsBlockComma" }
  ]
},

Given that definition, the following works and evaluates true:

        string inputEBNF = "a=\"5\";";
        var ebnfParserResult = ebnfParser.Parse(inputEBNF, "grammar");

However, if you changed the "5" to a literal though, which is invalid EBNF, you won't get an error, you'll get infinite recursion.

        string inputEBNF = "a=5;";
        var ebnfParserResult = ebnfParser.Parse(inputEBNF, "grammar");
        var exit = ebnfParser.CancelAllParsing;

The value of exit will be true, because the parser will go up to MaxLevel in recursion and I shut it off at that point, otherwise it would keep on going until a StackOverflowException.

It's when earlier blocks in the JSON config like rhsBlockComma call later blocks rhs that the problem starts.

In my code you'll see that Parser.Parse calls a method called Check: https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L1017

Check will then call CheckSequence: https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L954 which will check each section of a sequence, calling CheckSequenceSection https://github.com/JohnnyErnest/LexerParser/blob/main/Program.cs#L843

CheckSequenceSection looks through each lexer token in a JSON section's tokenList first, and then looks through each sequence in the section's sequenceList, and call CheckSequence on each sequence, which does a single recursion.

Normally that works fine if the input is valid. If not, a variable level is tracked and if it goes beyond MaxLevel then aborting occurs. I also needed to add a preemptive return before CheckSequence from CheckSequenceSection or else it would just keep doing recursions until a StackOverflowException.

What are some other methods to early detect recursion problems from malformed input?

John Ernest
  • 785
  • 1
  • 8
  • 20
  • 2
    **Any time** you are at risk from stack overflow, you should transform to iterative with a deterministic fixed size queue – Charlieface Jan 13 '21 at 03:09
  • @Charlieface see answer. I had a problem with trying to make the methods iterative and it has to do with a FIRST/FIRST conflict. – John Ernest Jan 13 '21 at 07:43

1 Answers1

0

Answer is explained better here: https://stackoverflow.com/a/20179940/5556724

The version of EBNF on Wikipedia is written like so:

rhs = identifier
     | terminal
     | "[" , rhs , "]"
     | "{" , rhs , "}"
     | "(" , rhs , ")"
     | rhs , "|" , rhs
     | rhs , "," , rhs ;

but should be written like so:

primary = identifier
        | terminal
        | "[" , rhs , "]"
        | "{" , rhs , "}"
        | "(" , rhs , ")"
        ;
factor  = primary , { "|" , primary } ;
rhs     = factor ,  { "," , factor } ;

otherwise there is a FIRST/FIRST conflict.

John Ernest
  • 785
  • 1
  • 8
  • 20