1

I'm writing my own programming language and have arrived at parsing tuples/lambda expressions after having gotten most of the other stuff out the way. The syntax is very similar to C# with some little differences. I was wondering does C# look-ahead to determine whether or not to produce a simple parenthesised expression, a tuple or a lambda, seems unlikely because the look-ahead (k) is indeterminate.

Right now I'm kind of hacking the solution together by peeking ahead for certain indicators that should give the parser a clue as to what it's trying to parse. i.e. check for an identifier @1 then check for an '=', ',' or ':' (colon is used to annotate identifiers with type information) at @2.

My lambda syntax is a little more flexible than what is available in C#, so guess in it you could just parse the list of identifiers up until there are no more commas and check the next 2 tokens for ')' and '=>'.

Is there some sort of trick I'm missing?

rtlayzell
  • 608
  • 4
  • 20
  • 1
    Basically, custom lookahead. Take a look at [this](https://github.com/dotnet/roslyn/pull/4024/commits/155c69098efddc4249800cd0641055bb8508cd32), for example. – Jeroen Mostert Jun 24 '19 at 19:47
  • Yeah, seems like the problems I'm having :p thanks, good find :) – rtlayzell Jun 24 '19 at 19:53
  • 1
    There's an LALR(1) grammar for what I believe is similar to C# lambda syntax in [this answer](https://stackoverflow.com/questions/16928725/is-cs-lambda-expression-grammar-lalr1/16931641#16931641). I'm not marking that as a duplicate because it doesn't handle tuples and because you say that your desired syntax is different (without saying what it is), and because you don't make it clear what sort of parser you want to build. But maybe the answer is useful anyway. – rici Jun 24 '19 at 19:56
  • The C# [Roslyn compiler](https://github.com/dotnet/roslyn) is open source. – NetMage Jun 24 '19 at 20:13
  • @rici I don't believe it can be done with LALR(1) grammar, you must know what comes after the identifier to determine whether or not it is a plain ol' expression, lambda parameter list, etc.. so at best probably LALR(2). – rtlayzell Jun 28 '19 at 13:12
  • @rtlayzell: if there's an lalr(2) grammar for a language, then an lalr(1) grammar also exists. That might be counter-intuitive but the mechanical construction of the lalr(1) grammar is not complicated. (It's very bulky, though, so it's often better to do it by hand.) The linked answer shows an example for what might be a similar syntax. Without knowing more details about your proposed syntax, I can't offer a more precise suggestion. – rici Jun 28 '19 at 14:23
  • I guess the difference in your language is the syntax for tuples, which might look like a list of lambda parameters. That's not an intrinsic problem, though, and it can be solved in a similar fashion to the linked answer. You need an intermediate non-terminal which is a list of IDs, and you avoid committing to tuple or lambda until you see what follows. That might seem like hand-waving but as I said it's hard to be precise without details. – rici Jun 28 '19 at 14:48
  • @rici that's pretty much what I'm doing, essentially I'm building an intermediate list of "binding expressions" and then depending on the token after the ")" I'm building a tuple or lambda, if it's a lambda I visit the binding expression list and converting "typed_binding"'s and "named_binding"'s to parameters to construct a lambda expression. I'm not able to demonstrate this as a BNF but here are a couple examples of what tuples and lambdas may look like (x, y) (x = 10, y = 20) (x: int, y: int) (x: int = 10, y: int = 20) only the first and third are valid for lambda's and tuples. – rtlayzell Jun 28 '19 at 17:25
  • So, I guess you are trying to build a predictive parser by hand, which means that the fact that there is an KAKR(1) grammar is more or less irrelevant. IMHO, any hand-built predictive parser is bound to be a bit hacky somewhere, which is likely why you cannot extract the BNF from it in order to see what language you are parsing. If I'm right about my assumption, then I don't know exactly what you are looking for here. If, on the other hand, you want help with creating a grammar suitable for automatic codegen, a description (in the Q) would be useful. – rici Jun 28 '19 at 18:46
  • Actually, my question has already been answered for the most part, by yourself and others in the thread. I don't necessarily need a grammar suitable for codegen as you say, just trying to either reaffirm my own approach or find a better one. Two approaches have come up so far, the intermediate non-terminal you mentioned and back-tracking, while back-tracking seems like the easier way to parse the expressions I cannot figure out a way to produce useful syntax errors that way, so I'm going with the intermediate approach. I've left the question open to get more answers before accepting any. – rtlayzell Jun 28 '19 at 20:08
  • @rtlayzell: the third approach is to use a GLR/GLL parser, which effectively uses a clever data structure to pursue different possible parses in parallel. In order to do so, it must avoid executing semantic actions with side-effects (if there are any) until the algorithm resolves to a single parse (if it does; GLR also works with ambiguous grammars). These days many parser generators implement one of these algorithms, and it's my favorite solution in most cases because it doesn't require obscure modifications to the grammar. I might add an answer later in the day – rici Jun 29 '19 at 15:20

1 Answers1

1

Yes you need unlimited lookahead, since in some contexts (a, b, c, ...) could be either a tuple or a list of lambda parameters depending on what token follows. There is another ambiguity since (a.b.c.d... could be either the type expression in a cast or a regular parenthesized expression.

You can see how C# solves this in the parser source code: http://sourceroslyn.io/#Microsoft.CodeAnalysis.CSharp/Parser/LanguageParser.cs,bab23e84fb31bf9e,references

Basically it saves the current position and try to parse one of the possible productions. If the parse fails, it goes back to the saved position (a.k.a backtrack) and then tries with the next possible production. If all possible productions fails it reports a syntax error.

It seems to run the potential parses in two passes. In the first pass it only checks if a parse is possible, and returns a boolean. If this test succeeds, it runs a second pass where it actually constructs the parse tree.

In theory unlimited look-ahead will kill parser performance, but in practice it probably isn't a big issue since it only happens in certain specific contexts and tuples and lambda parameter lists typically are of limited size.

Language designers tend to avoid syntax requiring look-ahead due to the added complexity it imposes on the parser and the performance implications. But the designers of C# probably decided the cost was worth it for adding the nice tuple syntax.

Kirill Osenkov
  • 8,786
  • 2
  • 33
  • 37
JacquesB
  • 41,662
  • 13
  • 71
  • 86
  • That's interesting, I had thought of saving the parse state and back-tracking after a failed parse though it seems like it would be difficult to gauge the users intention and produce the correct syntax error. e.g. if parsing a lambda only fails because the body of the lambda has an error then we should push that error. But if the parsing the lambda fails because the expected '=>' token is missing then we should not push the error(?). – rtlayzell Jun 28 '19 at 12:57
  • Also, my language does not use parenthesised exressions as casts, I have a clearer /unambiguous syntax for that. – rtlayzell Jun 28 '19 at 12:59