What is the class of languages that can be parsed with yacc?

Question

I recently learned that C does not have a context-free grammar. I also recently learned that gcc used to use yacc to parse C. The manual for the the yacc utility states "The class of specifications accepted [by yacc] is a very general one: LALR(1) grammars with disambiguating rules", while Wikipedia states that LALR grammars are a subset of deterministic context-free grammars, which are a subset of the context-free grammars. If C is not even context-free (much less a deterministic context-free language), and yet yacc can parse C, then what class of languages can yacc parse, if not the subset of context-free languages that have LALR(1) grammars?

C is usually compiled with Yacc with context-aware feedback between the lexer and the parser. This is so that things like `typedef` names can be handled correctly. — Jonathan Leffler, Jun 13 '19 at 01:08
Yes, this means that C is not context-free. My question is, if C is not context-free, then it is not a LALR(1) language, and if C can be parsed by yacc (which claims to accept specifically LALR(1) grammars), then what class of languages does yacc actually parse? It's obviously not languages with LALR(1) grammars, or it wouldn't be able to parse C. — user9889635, Jun 13 '19 at 01:13
FYI, the first question you linked to is about C++, not C, although it's true for C as well. — Barmar, Jun 13 '19 at 01:23
Ah, you're right, I meant to link another thing, thanks! It should be fixed now. — user9889635, Jun 13 '19 at 01:24
As @JonathanLeffler said, it uses feedback between the parser and lexer. My guess: when a name is defined, it gets added to a token table used by the lexer, and future uses of the name get classified correctly. This makes it context-sensitive, even though the grammar is not. — Barmar, Jun 13 '19 at 01:28
The quotation from the *yacc* manual answers your question compeletey. and the Wikipedia entry doesn't contradict it. — user207421, Jun 13 '19 at 02:24
The contradiction is that C is not a context-free language, yacc supposedly accepts a subset of context-free grammars, and yacc can parse C. Either C is context-free, or yacc can parse things that aren't in the LALR subset of context-free grammars, in which case my question still stands of what exactly DOES yacc parse? — user9889635, Jun 13 '19 at 02:39
That's not what the *man* page says, and it's not what Wikipedia says either. Neither of them states that "*yacc* supposedly accepts a subset of context-free languages'. — user207421, Jun 13 '19 at 03:12
I did not say it was from the man page, I said it was from the manual, and the quote I gave literally links to the yacc manual where it says that yacc accepts a LALR(1) grammar as input. The Wikipedia page says that LALR grammars are a subset of the context-free grammars. Ergo, yacc accepts a specification of a subset of the context-free grammars as input. — user9889635, Jun 13 '19 at 03:19
It does indeed say that, and it also says that 'the class of specifications accepted by *yacc* is a very general one', and these two statements are not in conflict. It accepts LALR(1) specifications *and* ditto 'with disambiguatiing rules'. If it said it *only* accepted LALR(1) you would be correct, but it doesn't. — user207421, Jun 13 '19 at 04:42
@user207421 LALR(1) grammars with disambiguating rules are still not more powerful than context-free grammars. Disambiguating rules are about how to deal with ambiguous CFGs, they do not allow YACC to suddenly accept context-sensitive grammars. As others have said, YACC can parse C by making the parser feed information back to the lexer - that has nothing to do with disambiguating rules. — sepp2k, Jun 13 '19 at 13:18
@sepp2k I didn't say anything about which is more powerful. Don't put words into my mouth, and don't teach granny to suck eggs.I studied all this in 1979 with Frank de Remer. I am merely pointed out that the sentence the OP quoted answers his own question. — user207421, Jun 13 '19 at 23:11
@user207421 But it doesn't. The question was how YACC can be used to parse a context-sensitive language like C when it only accepts LALR grammars and you suggested that this is because of the disambiguating rules (at least that's the only way I see of interpreting your previous comment). I pointed out that it has nothing to do with disambiguating rules (or the generality of the class of grammars that yacc accepts), but about interactions between the lexer and the parser, which the quoted paragraph says nothing about. So no, it doesn't answer OP's question. — sepp2k, Jun 14 '19 at 09:54

score 5 · Accepted Answer · answered Jun 13 '19 at 03:09

Yacc generates computer programs, which are pretty well Turing complete. The yacc framework uses an LALR(1) framework to trigger actions, but the actions are arbitrary code.

Moreover, the input to yacc is a stream of tokens, not a direct input. The stream of tokens is produced by another computer program written in a Turing complete language, which can also manipulate its input in ways not restricted to context-free transcoding.

Finally, nothing stops a yacc-generated parser from initially accepting a superset of the intended language and then later analysing the context-free parse tree and rejecting certain constructs based on arbitrary computations, such as insisting that variables be declared before use (a context-sensitive computation).

In short, real-world parsers are pragmatically written programs, not theoretical academic exercises. Languages parsed by bison/yacc are generally "mostly" LALR(1), and their lexical analysis is generally "mostly" regular, but don't be surprised when the computer programs exploit their full power to transcend these limits.

That's what makes programming the interesting activity it is.

None of that makes the academic theory less useful. Bison/yacc and other parser generators take a lot of the gruntwork out of building parsers, because they can handle "most of" the analysis. And the closer a language comes to the analysable context-free model, the easier it is to generate ("most of") other useful tools: linters, syntax highlighters, reformatters, indexers, static analysers, documentation extractors, etc., etc. To say nothing of functioning as documentation for the syntax of the language itself.

I guess if the lexer is Turing-complete, then it could literally just parse the entire thing and then emit a serialized representation of the complete parse tree, which would mean that theoretically it could parse up to the natural languages. This answers my question pretty well. Thanks! — user9889635, Jun 13 '19 at 03:22
@user: the most obvious deviation of C from context-freeness is the preprocessor, which is a fundamental part of the language regardless of what anyone says. C macros are not quite Turing complete, but the consequence of preprocessing does not correspond to any clean theoretic model. And yet, there it is. Yacc is cheerfully ignorant both of what transformations are done before it's handed the tokens, and of what happens after each reduction is identified. So, yeah, the sky's the limit. But some languages are a better match for yacc than others. — rici, Jun 13 '19 at 03:31

score 1 · Answer 2 · answered Jul 04 '19 at 18:22

C doesn't have a context free grammar only for the trivial reason that the semantic classification of an identifier token (its lexical category) sometimes requires understanding how it is declared. C is designed for one pass compiling, so that at any point in the parse, everything that is relevant to parsing is known from prior declarations. The declarations in scope can be used to assign a lexical category to a token.

So for instance if the parser is facing (A)(B) in the middle of a statement block, this could be:

expression (B) being cast to type A.
argument list (B) being applied to function expression (A).

But this ambiguity doesn't have to arise in the parser because the lexical analyzer can peek into the scope, and classify A differently based on whether it is a typedef name, or something else, and these differently classified identifiers can then be targeted by separate grammar rules. It's like having a magic oracle that tags tokens with semantic info, so then the context-free technique can be applied.

Another problem, in the first place, in C is that it has a preprocessor. The grammar of C is specified in separate pieces: there is a preprocessor grammar, and there is a grammar for the preprocessed token stream. There can't be a context-free grammar for C as such which captures the nuances of its phrase structure, because preprocessing can redefine the syntax, and macros can be called anywhere, except within comments and string literals.

What is the class of languages that can be parsed with yacc?

2 Answers2