3

I have encounter a problem with parsers having two branches of recursion. To demonstrate the problem easier, I use a simple grammar of a lambda calculus from the article written by Luca Bolognese as the example:

<expression> ::= <name> | <function> | <application>  
<name> ::= non­blank character sequence  
<function> ::= \ <name> . <body>  
<body> ::= <expression>  
<application> ::= ( <function expression> <argument expression> )  
<function expression> ::= <expression>  
<argument expression> ::= <expression>

The parser in the article is quite concise:

let ws = " \t\n" 
let specialChars = ".)(\\\n" 

let pWs = spaces 
let pName = manyChars (noneOf (ws + specialChars)) |>> EName 

let pExpr, pExprRef = createParserForwardedToRef<Expression, Unit>() 

let curry2 f a b = f(a,b) 
let pFunction = pchar '\\' >>. pipe2 pName (pchar '.' >>. pExpr) (curry2 Function) 

let pApplication = pchar '(' >>. pipe2 pExpr (pWs >>. pExpr) (curry2 Application)
                            .>> pWs .>> pchar ')'

do pExprRef := pFunction <|> pApplication <|> pName 

let pExpressions = sepBy pExpr spaces1 

let fparseString text = 
    match run pExpressions text with 
    | Success(result, _, _)   -> result 
    | Failure(errorMsg, _, _) -> failwith (sprintf "Failure: %s" errorMsg) 

I'm interested in pApplication since they consist of two pExprs which in turn could be pApplications too. The parser runs out of stack space in the following benchmark:

let generateString level =
    let rec loop i =
        seq {
                if i < level then
                    yield "("
                    yield! loop level
                    yield " "
                    yield! loop (i+1)
                    yield ")"
                else 
                    yield "(x x)"
        }
    loop 0 |> String.concat ""

let N = 5000
let s = generateString N;; 
let _ = fparseString s;;

How can I rewrite the parser to be tail-recursive?

I recognized the problem when trying to write a parser for a Lisp-like language and test it with real benchmarks. I have Term and VarBinding which are mutually recursive types and a let parser which exhibits the same issue as pApplication above. My parser is on github in case the analysis is wrong regarding the not tail-recursive problem.

pad
  • 41,040
  • 7
  • 92
  • 166

1 Answers1

6

The built-in parser combinators of FParsec generally don't allow for a tail-recursive parser implementation, mainly because of the way the error handling is implemented.

This means that the recursion depth of an FParsec parser is limited by the stack size -- as is the case for most recursive descent parsers. Of course, this doesn't affect the parsing of sequences with many, sepBy, chainl etc., because these FParsec combinators all have implementations with constant stack space usage.

The default stack size in .NET is usually more than enough for parsing human-generated input in well-specified formats with FParsec (assuming you parse sequences with the built-in combinators). However, machine-generated input with a deeply nested structure (like your 5000 level deep s-expressions) can easily lead to a stack overflow.

If you know in advance that the nesting depth in your input is limited to a reasonable value, you could simply increase the stack size to an appropriate value. Hopefully this is true for your benchmark data.

Otherwise you may have to implement a special purpose parser function for the recursive element(s) of your grammar. You would need to implement this function using the low-level API of FParsec and you would obviously have to implement this function such that it uses the heap instead of the stack for keeping track of the nesting context and intermediate parsing results.

Community
  • 1
  • 1
Stephan Tolksdorf
  • 3,062
  • 21
  • 28
  • It's unfortunate that benchmarks I use are all machine-generated. Since the language intendeds to be machine friendly and easy to parse, I cannot predict the nesting depth. Can you shed some light how to implement heap-based parsers using low-level API? – pad Feb 12 '12 at 22:43
  • Benchmarks don't have to reflect real world usage and might just be artificially constructed to stress-test the parser. I do have the suspicion that nesting-depths of a few thousands levels are very rare in practice (I'd be very interested in hearing about any real world input that contains such nesting). Before rewriting my parser, I'd make sure that just setting the stack to 10 or even 50 MB really isn't enough. – Stephan Tolksdorf Feb 12 '12 at 23:21
  • Having said that, if you really want to rewrite your parser, just think about how you would implement the parser if you'd manually parse directly from a string. The low-level FParsec parser will look very similar, only with the string access replaced with `CharStream` method invocations and the final result or error returned in a `Reply` value. For the non-(deeply-)recursive parts of the grammar, you could of course still use normal FParsec parsers. Maybe I'll later have some time to implement an example. – Stephan Tolksdorf Feb 12 '12 at 23:22
  • +1, I will try with bigger stack space. A small example demonstrating tail-recursion would be really helpful. – pad Feb 12 '12 at 23:31
  • Good news, when I increase stack limit to 50MB, I'm able to parse successfully 40000 benchmarks some of which have quite deep nestings. Thank you for the great library. I'm still interested in tail-recursion example though. – pad Feb 14 '12 at 17:13