34

(I am spending the holiday time on some language theory. Excuse me if this is a naive question.)

According to here:

LL grammars, particularly LL(1) grammars, are of great practical interest, as parsers for these grammars are easy to construct, and many computer languages are designed to be LL(1) for this reason.

So, out of curiosity, which contemporary computer langauges are LL(1) ? Does C, Java ,C# or Python fall into this category?

smwikipedia
  • 61,609
  • 92
  • 309
  • 482
  • I did read a very interesting article on exactly this topic [and using grammars ans regex], and had thought I'd bookmarked it but unfortunately it seems not. I seem to think Perl is up there. (sorry for this all round unhelpful comment) – Martin Jan 01 '17 at 12:00
  • 1
    @Martin Perl is definitely not LL(1). Parsing Perl is undecidable. – sepp2k Jan 01 '17 at 12:11
  • It's the invert then, Perl is at the bottom (LL4?). Interestingly the only language written by a linguist and so it's meant to be much easier to write as a human, than read as a machine (comparatively speaking). – Martin Jan 01 '17 at 12:21
  • 1
    A better place to ask this question would be [Lambda the Ultimate](http://lambda-the-ultimate.org/faq). See related question [Good languages with simple grammar](http://lambda-the-ultimate.org/node/984). – Guy Coder Jan 01 '17 at 14:54
  • 2
    @martin: the `(k)` in `LL(k)` refers to the number of lookahead tokens you might need to look at in order to decide what to do with the current token. Many languages are not `LL` at all, and in practise, it rarely helps to increase the value of `k`, although you can increase the power of `LL` parsing if you allow `k` to be unbounded (see ANTLR, for example). In that case, the parser is no longer linear time and you might want a more powerful algorithm, such as LR. – rici Jan 01 '17 at 15:42
  • Ahh I completely misunderstood. I was reading previously about the grammar compexity of computer languages and they where on a scale of 1-4 and I think referenced by `LL`, this was the interesting link I couldn't link. Sorry for the confusion. – Martin Jan 01 '17 at 15:54

2 Answers2

27

I think I'd be tempted to flag that Wikipedia quote with [citation needed]; the assumptions are, at least, questionable. For example, there are a large number of compiler-construction tools based on yacc which make it extremely easy, in practice, to construct a parser using the more powerful (and equally fast) LALR algorithm, and some also implement the even more power (and almost as fast, in most practical grammars) GLR algorithm. Hand-writing parsers has not been necessary for decades. [Note 1]

By way of attempting an answer to the question:

  1. Most computer languages are "technically" not LL because they are not even context-free. That would include languages which require identifiers to be declared (C, C++, C#, Java, etc.), as well as languages with preprocessors and/or macro facilities (C and C++, among others), languages with ambiguities which can only be resolved using semantic information (Perl would be the worst offender here, but C and C++ are also right up there). And, just to spread the joy around some more, it also includes languages which have Python-like layout awareness (Python, of course, and also Haskell).

    I put scare quotes around "technically" because there are a lot of people who will say that these exceptions "don't count". That is their opinion, and they are entitled to it (and anyway I've given up arguing about it, although I don't share that opinion). You could, for example, eliminate the preprocessor from C/C++ by preprocessing the text before applying the grammar, or preprocess whitespace-aware languages by inserting INDENT, NEWLINE and DEDENT tokens instead of the whitespace, after which you could make some kind of claim about a mystical "core language". (That is much harder to apply to C++ templates, which can only be eliminated by parsing the text, so you can't claim that the expansion can be done prior to parsing.)

    The claim that a language is not context-free because it requires identifiers to be declared is perhaps a bit more controversial. In some languages (not C/C++, in which the semantic analysis is required to avoid ambiguity), you could argue that a parse tree could be constructed without validating the declaration rule, which makes that rule "not syntactic". But it remains the case that you can enforce the declaration rule syntactically; you just cannot do it with a context-free grammar (you could use a Van Wijngaarden grammar, for example).

    In any case, a common parsing strategy is to recognize a superset of a language, and then reject non-conforming programs through a subsequent analysis of the resulting parse tree; in that case, the question becomes whether or not the superset is LL. In my opinion, in order for that to be meaningful, it is necessary that every valid program can be parsed to the correct syntax tree, which eliminates the usage of semantic analysis to disambiguate.

  2. The goal of parsing is to produce a syntax tree, not solely to recognize whether a text is valid or not. (You might miss this important fact if you skim over formal language textbooks which tend to focus on recognition, possibly because the construction of syntax trees is less interesting.) So it seems to be reasonable to insist that a proposed grammar actually represent the syntactic structure of the language.

    For example, you can recognize algebraic expressions using a simple regular language:

    (PREFIX* OPERAND POSTFIX*) (INFIX PREFIX* OPERAND POSTFIX*)*
    

    where PREFIX is the set of prefix operators as well as (, POSTFIX is the set of postfix operators (if any) as well as ), INFIX is the set of infix operators (addition, subtraction, multiplication, etc.), and OPERAND is an identifier or constant.

    That regular expression will not correctly reject expressions with unbalanced parentheses, but we already agreed that it was OK to recognize a superset of the language, right? :-)

    If desired, we could remove the parentheses from the PREFIX and POSTFIX sets and make OPERAND a recursive production. The resulting grammar is trivially LL(1) [Note 2]:

    operand    => IDENTIFIER | CONSTANT | '(' expression ')'
    term       => operand | PREFIX operand | operand POSTFIX
    expression => term | term INFIX expression
    

    The problem, however, is that this grammar does not capture operator precedence. It does not even attempt to recognize the fact that a+b*c and a*b+c are both sums, so that the top-level operator is + in both cases, and not whatever operator happens to come first in the expression. (If you were parsing APL, this wouldn't be an issue. But most languages conform to the usual expectations about operator precedence.)

    This point is important because an LL grammar cannot handle left-recursive productions, and you need a left-recursive production in order to correctly parse a left-associative operator. (That is, to correctly parse a-b-c as ((a-b)-c) rather than (a-(b-c)), which would have a different value.) Again, you could argue that this is a quibble, since you could post-process the parse tree in order to correct the associativity. This is certainly true, but the result is that the grammar you use to parse is different from the grammar you use to explain the syntax of the language. This might or might not bother you.

    It's worth adding here that there are languages (Haskell, Prolog, probably many more) in which you can define operators and their precedence in the program text. That obviously makes it impossible to generate a correct syntax tree without semantic analysis, and the usual approach to parsing such languages is to use precisely the simplified "alternating operand and operator" language outlined above. Once the operator precedences are all known, presumably at the end of the parse, all expressions are then reanalyzed using something like the Shunting Yard Algorithm, in order to produce the correct parse.

  3. Let's suppose we discard the above objections, and just ask "for which commonly used programming languages might an LL parser be used?"

    In order to avoid cheating, though, we should really require that the LL parser have fixed lookahead and not require backtracking. If you allow arbitrary lookahead and backtracking, then you considerably expand the domain of parseable languages, but I contend that you are no longer in the realm of LL. That will eliminate, for example, the template- and preprocessor-free subset of C++, even though the common compiler implementations use recursive descent parsers, since backtracking is required in order to resolve the "Most Vexing Parse" ambiguity. (Anyway, you can't really remove templates from C++, and parsing with templates is really hard. [Note 3])

    Java and Python were both designed to be LL(1) "pseudo-parseable". I believe Haskell falls into this category as well. C cannot be correctly parsed without semantic information (classic ambiguity: is (x)*(y) a cast or a multiplication? -- it depends on whether x has been typedef'ed or declared as a variable) but I'm pretty sure it can be captured with a non-backtracking recursive-descent parser. I haven't looked at C# in depth, but this answer by Eric Lippert suggests that it requires backtracking in some cases.

Notes

  1. Of course, people still do it, and in many cases for good reasons (producing better error messages, for example). But "it's difficult to construct an LALR parser" is not a good reason, since it's not.

  2. That's not quite LL, because I didn't left-factor. But it's easy enough to do; I'll leave it as an exercise.

  3. See Is C++ context-free or context-sensitive?. Also Todd Veldhuizen's classic C++ Templates are Turing Complete

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
  • "The claim that a language is not context-free because it requires identifiers to be declared is perhaps a bit more controversial" -- Why would something which is trivially true be controversial? – John Coleman Jan 01 '17 at 19:53
  • @JohnColeman: I'll refrain from the obvious political analogy and leave you to read the comment threads in my posts on the subject. I was also surprised that it was controversial. – rici Jan 01 '17 at 19:54
  • 1
    To me it is sort of like the fact that physical computers are not universal Turing machines because they have finite memory. Trivially true, but perfectly consistent with the fact that Turing machines provide a good way of thinking about physical computers. Similarly *real life* programming languages are almost never context-free, but this trivial truth is perfectly consistent with the fact that context-free grammars provide a good way to think about many such languages. – John Coleman Jan 01 '17 at 19:59
  • @johncoleman: sure, and a cfg parser is *part* of any c++ parser. But it's not the whole story, by a long shot. (Python, which does not require declarations, really is CF, modulo the layout sensitivity. But a close examination of the grammar used to parse, compared with the grammar used to explain, shows some of the limitations of LL parsing. Perhaps I should add a reference for that.) – rici Jan 01 '17 at 21:39
  • 3
    Your "technically" seems based on the failure to distinguish "the language" from "the syntax of the language". "Time flies like a banana" is a syntactically valid sentence, that makes litte sense semantically. That the syntax of a language is LL(1), doesn't require that all syntactically valid program texts are also valid programs. Also, the case that some symbols of a language have a special lexical representation (indents in Python) does not cause the language to not be LL(1). You mention VW-grammars. VW-grammars distinguish the arbitrary representation and grammatical terminal symbols. – LHP Dec 24 '19 at 10:00
  • 1
    Further, VW-grammars are rather special, in that they are Turing-equivalent languages, and as such can express the full semantics of a programming language. (as demonstrated by Cleaveland and Uzgalis in their book "Grammars for Programming Languages".) – LHP Dec 24 '19 at 10:08
  • @LHP That's what I wondered as well. I'll try to add a bit more explanations. let's have a really simple language, involving defining integer variables and printing them, each instruction is ended by a ";" like { var myvar = 10; print myvar2; }. Since any instruction starts by either "var" or "print", it is LL(1) : each type of expression can be deduced by it's first token. When building the AST, we don't if the var is declared when printing it : PROG[DCL["myvar", 10], PRINT["myvar2"]. Finally, when parsing the AST to build program, we check for existence of variables and return an error. – Felix Bertoni Aug 21 '20 at 13:48
  • 2
    This seems like an extended comment rather than an answer. You're correct that languages like C and C++ are context sensitive because they require semantic information to resolve syntactic ambiguities. But it's a reach to claim that all languages that require identifiers to be declared are context sensitive. For example, Pascal requires declarations, but that's a semantic requirement--those declarations are not necessary to resolve syntactic ambiguities nor to build a parse tree. – Adrian McCarthy Jun 29 '21 at 23:49
  • @adrian: I could add Pascal to the list in point 3, if it's parseable up to declaration checking with an LL grammar (I've only done it with an LR parser, but it's probably in the list). But I stand by my claim that it's not LL in a formal language sense because inputs with undeclared variables are not well-formed. You an construct a parse tree for the pseudo-English "*All mimsy was the borogoves" but I think I'm not the only one who would say that's a grammatical and not a semantic error, without knowing what a borogove is or how to recognise its mimsiness... – rici Jun 30 '21 at 02:22
  • 1
    In part, that's the point of the rhyme, and isalso what Chomsky was trying to express with the grammatically correct utterance about colorless green ideas. In the same way, I don't actually need to know anything about the semantics of a Pascal program to observe that an identifier is not in scope. To my mind that's a lot more similar to subject-verb disagreement (syntactic) than semantic violations like furious sleep. As I said, I accept that not everyone will approach syntax from this perspective but it's definitely the way I learned the terms as a mathematical linguist, and I think... – rici Jun 30 '21 at 02:28
  • it's a valid answer. But I did write point 3 with the objection you raise in mind. Nothing in the answer implies that LL or LR parsers aren't useful. Just that they cannot capture the entirety of "well-formed", (which is not the same as "could blow up at runtime wiith the wrong input"). – rici Jun 30 '21 at 02:30
  • Fwiw, I think the criticism of LL in point 2 should be sufficient independent of grammatical agreement. – rici Jun 30 '21 at 02:40
  • @rici: Is it even possible for a finite generative grammar alone to "capture the entirety of 'well-formed'" for a Turing complete programming language? The C++ Standard, for example, defines a well-formed program as a "C++ program constructed according to the syntax rules, diagnosable semantic rules, and the one-definition rule." Is it even possible to express something like the one-definition rule in a grammar? Or is it an inherently semantic constraint? (BTW, I consider the sentences about borogroves and green ideas to be grammatically correct nonsense.) – Adrian McCarthy Jun 30 '21 at 22:38
  • @adrian: sure, generative grammars are turing complete. If you restrict them to a single non-terminal on the left-hand side, then C++ is impossible but you could define well-formed awk without too much trouble. As for the borogoves, at least in my idiolect, the version I provided weren't grammatical, likes these sentence. I think that's a common view of English grammar, and it's what Chomsky hoped to achieve. – rici Jul 01 '21 at 00:30
  • If I understand you correctly, you are saying that "PROGRAM p; BEGIN q(a); x := f(y) END." is not a valid Pascal program, as it does not declare any of its identifiers except the program name, and _therefore_ also not a grammatically correct Pascal program? If so, I believe your view is rather uncommon, or at least that it used to be so. Also you do need to know Pascal semantics to assess this: "PROGRAM p(output); BEGIN writeln END." writeln is an identifier, it is not declared, has a well-defined meaning here, but could have been redeclared as something else like any other identifier. – LHP Nov 01 '21 at 08:24
  • @lhp: your example is certainly not a valid Pascal program, right? It is not necessary (or possible) to start to execute it to determine that it's not valid; the error is purely textual, or "static" in the words of the Pascal report. A language, in formal language theory, is just the set of texts which satisfy the language's definition, and that set cannot be generated with a CFG. There is a context-free superset, but that's always the case, since Σ* is regular and a superset of any language over Σ. So the existence of a CF superset is not mathematically interesting. – rici Nov 02 '21 at 04:49
  • Also, i don't require the semantics to validate a program which uses writeln. I just need the implicit prologue, which can be written as a text string containing a sequence of Pascal declarations. It's still textually verifiable and this static rather than dynamic. – rici Nov 02 '21 at 04:54
  • My examples certainly are grammatically valid Pascal programs. Both of them. The first isn't executable, because the identifiers are not defined, but saying it isn't valid would be like saying "Santa brings presents for Christmas" isn't an English sentence because Santa does not exist. And that "implicit prologue" you mention can not "be written as a text string containing a sequence of Pascal declarations", as the declaration for writeln cannot be written in Pascal. And if we can assume such a prologue, it could just as well include definitions of q, a,x, f and y... – LHP Nov 05 '21 at 05:44
3

If I take "computer languages" more broadly than "programming languages," I believe there are several declarative languages you could consider, though it may also depend on what you consider contemporary.

Possible candidates:

  • XML
  • LLVM IR
  • many configuration languages, like INI files
  • some network protocols
  • some data formats, like CSV files

Most (all?) flavors of the languages that describe regular expressions are not regular expressions but are LL(1).

For actual programming languages, these are all probably too old to be considered contemporary, and many have common extensions that probably violate LL(k).

  • assembly languages
  • BASIC
  • LISP
  • The Wirth languages: Pascal, Modula, Oberon
  • Prolog

I don't know all of the above languages well enough to be conclusive, which is why I suggest they are possible candidates. If you know of an syntactic ambiguity in any of these, please educate me in the comments.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175