7

Is there any reason why there are no parser generators that consume straight BNF?

I'm familiar with JavaCC and Antlr, and recently came across Parse2. It seems that each has its own notation. BNF is really easy to read and the other notations aren't. BNF is unambiguous. Is there some inherent reason why I can't feed BNF to a compiler compiler and get a parse tree out?

ccleve
  • 15,239
  • 27
  • 91
  • 157
  • I used a PG written by de Remer and Pennello that used BNF as its grammar language but it isn't common. One issue is that BNF isn't self-describing. – user207421 Oct 23 '14 at 21:49
  • The difference between BNF and ANTLR/Flex notations is that the former uses `::=` between the left- and right-hand sides, while the latter uses `:`, and that the latter allow (or encourage) you to end productions with a semicolon. How does that make one of those notations "really easy to read" and the other one not? – rici Oct 23 '14 at 22:04
  • 1
    BNF also has the angle brackets around the variables instead of quotes around the constants, which is why it isn't self-describing. I can't agree that it's easier to read than what's commonly used now. – user207421 Oct 23 '14 at 22:12
  • @close-voters This question is not 'primarily opinion-based'. There were good reasons for the move away from BNF and I have listed a couple above. – user207421 Oct 23 '14 at 23:08
  • @EJP: true, another superficial difference. (Didn't the original BNF use **bold** to represent constants? You could think of that as separating the described alphabet from the describing alphabet, although describing that two-font structure would require a third font, which leads to infinite descent. On the whole, quotes are simpler.) – rici Oct 23 '14 at 23:10
  • @rici No. See the [Algol-60 report](http://web.eecs.umich.edu/~bchandra/courses/papers/Naure_Algol60.pdf). – user207421 Oct 23 '14 at 23:11
  • @EJP: I was five years old in 1960; I started programming a little before Algol-68, which was my first exposure to Algol. I'm pretty sure that was typeset; I can't vouch for Algol-60. But I do remember when manpages did bolding by overtyping twice, so typewriters and bold are not incompatible. Italics, on the other hand, require a replaceable typeball. Or a pair of pliers. – rici Oct 23 '14 at 23:14
  • @rici You asked about 'original BNF'. BNF originally appeared in the Algol-60 report. Algol-60 preceded Algol-68. – user207421 Oct 23 '14 at 23:28
  • 1
    @EJP, if the PDF you provide is the actual document, it certainly uses typographical variation, but not as I suggested: "underlining [in typewritten copy, boldface type in printed copy] is used for defining independent basic symbols." So I stand corrected. The `|` mark does not appear to be part of the target alphabet at all; the printed -- typographically enhanced :) -- report does not use < and > to delimit non-terminals, but rather 〈 and 〉 which avoids ambiguity but does not allow for self-description. Really, none of this justifies any claim about readability, though. IMHO. – rici Oct 23 '14 at 23:29
  • You're quibbling. I didn't say anything about one kind of angle bracket versus another. The issue here is that BNF quotes the variables, and has done since 1960, as I have shown, when the specific terms you're now using didn't even exist, and that this reduces readability, among other things. – user207421 Oct 24 '14 at 04:58
  • Afaik Coco/R also accepts (E)BNF – Marco van de Voort Oct 27 '14 at 14:51

3 Answers3

2

Is there any reason why there are no parser generators that consume straight BNF?

No, but most parser generators started out trying to do everything. It took the community a long time to figure out that sometimes less is more.

Tools that consume (some variation) of BNF have been around for a long time.

A paper describing MetaII was published back in 1963. MetaII's baseline BNF is what you expect (lefthandside plus right hand side) plus a number of now relatively standard EBNF extensions (Kleene Star and Plus, Groupting, alternatives) and non-standard embedded actions (used to generate literally syntax-directed translation output). It assumes built-in token lexers. MetaII can meta-compile its own grammar to reproduce itself exactly [the paper shows this and it is mind-blowing moment when you grok why this works], thus enabling bootstrapping to more sophisticated versions. and to compile other simple languages. A common easy extension is defining grammar rules for lexical tokens. I built a variety of implementations, and tools that used this, back in the 1970s because it was so cool and so useful.

Tree Meta added actions for building trees, and an auxiliary grammar for pattern matching trees to generate output.

When you add in all the extra EBNF and generation stuff, the resulting "BNF" for both MetaII and Tree Meta can be fairly hard for the uninitiated to read, mostly because it is dense and you have to be familiar with context. Otherwise it looks like the output of a chain printer (for those of you old to know what this is) that broke.

Most modern compiler-compilers aren't much different. YACC has extended BNF with embedded (your favorite language) code to do actions, usually used to build trees. JavaCC uses an extended BNF; with JJTree you can build ASTs. ANTLR 1/2/3 also has an extended BNF, with embedded actions for building trees. That makes them just as, er, ugly as MetaII... no progress in 40 years IMHO.

Our DMS Software Reengineering Toolkit (see my bio) uses the barest BNF you can imagine, for truly complex grammars such as IBM Enterprise COBOL, Fortran 2005 and C++14. It looks like:

  LHS = RHS1 RHS2 ...  RHSn ;

for various tokens LHS, RHS1 ... RHSn. Lists are easy: you use two rules, one for the base case, one for list extension. Alternatives are easy: code another rule. It is technically straightforward to code grammar rules for tokens simply as grammar rules whose terminals are actual characters. (DMS offers, and we typically use, a real lexer for parsing speed reasons).

Here is a DMS BNF for high school algebra (and a bit of calculus; some liberties in notation):

equation = sum ;
sum = product ;
sum = sum '+' product ;
sum = sum '-' product ;
product = term ;
product = product '*' term ;
product = product '/' term ;
term = primary ;
term = primary '^' term ;
primary = NUMBER ;
primary = NAME ;
primary = '(' sum ')' ;
primary = '-' primary ;
primary = NAME '(' sum ')' ;
primary = 'D' NAME ':' primary ; -- differentiate primary
primary = 'I' sum 'D' NAME ; -- indefinite integral
primary = 'I' sum ',' sum ':' sum 'D' NAME ; -- definite integral

A real DMS grammar file has other things it for describing prettyprinting etc. Here you can see a bigger example of a DMS grammar for M6809 assembly code.

What is interesting is that DMS builds ASTs using only the grammar as a guide; no additional tree-building actions. By avoiding the need to specify actions-while-parsing (to build tree nodes), the resulting grammars are pretty simple to read.

DMS has been doing this since about 1998; it is the first tool I know that took this approach. The ANTLR guys finally figured out this was a good idea, and now ANTLR4 as of 2013 will build a tree without any explicit embedded actions, although it still has actions for other purposes. DMS's trees can be either concrete (following the grammar directly) or "abstract" (many tree nodes are dropped as they can be reconstructed by DMS on demand, having the abstract tree and the grammar). The abstract trees are actually pretty decent. I don't know what ANTLR4 does here.

The really nice thing about this approach is one can write and revise really complicated, big, grammars by simply revising the rules; the construction of ASTs is "free" and it is always "right" with respect to the grammar. That allows DMS to provide a variety of other related tools (prettyprinting, source-to-source level transformation) using that as baseline. (DMS is technically a "meta" compiler in the sense that it can parse its own grammars using its own grammar; we use this capability of DMS to generate the parsers implied by those grammars).

You can see an full example of this at "algebra as a dms domain", also via my bio. (The BNF was taken from there). I'd provide links, but many SO people dislike that.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
2

Marpa::R2, Perl interface to Marpa, a general BNF parser, accepts straight BNF as grammar description and generates a parser for it in Perl. Here is an example taken almost literally from a BNF grammar tutorial.

<tree>  ::= '(' <list> ')'
<list>  ::= <thing> | <list> ',' <thing>
<thing> ::= <tree> | <name>             
<name>  ::= 'ant' | 'bat' | 'cow' | 'dog' | 'cat'

The full code example.

rns
  • 771
  • 4
  • 9
0

Check this one out: https://bnfc.digitalgrammars.com/

What is the BNF Converter?

The BNF Converter is a compiler construction tool generating a compiler front-end from a Labelled BNF grammar. It is currently able to generate C, C++, C#, Haskell, Java, and OCaml, as well as XML representations.

Given a Labelled BNF grammar the tool produces: •an abstract syntax implementation •a case skeleton for the abstract syntax in the same language •an Alex, JLex, or Flex lexer generator file •a Happy, CUP, or Bison parser generator file •a pretty-printer as a Haskell/Java/C++/C module •a Latex file containing a readable specification of the language

This open source one also seems promising: https://github.com/navstev0/nodebnf

BNF is both a framework for an interpreter, BNF compiler, and a language parser. It can use BNF, ABNF, or a blend of the two. The prior version used a custom JavaScript mark-up which was molded after BNF to interpret script files, this feature was removed in favor of inline compiling.

Ant here is a C++ one which is open source also: https://github.com/chsu16/BNF-Compiler#bnf-compiler

BNF-Compiler A context free grammar compiler written in C++ with flex/bison. Implement lexical analyzer, LALR parser, abstract syntax tree, symbol table with type checking and code generation. This projects is built upon Compiler Design class at UCSC.

Eng. M.Hamdy
  • 306
  • 1
  • 3
  • 12