c++ parser and formatter using a single grammar declaration

Question

I have this idea to be able to 'declare' a grammar and use the same declaration for generating the format function.

A parser generator (e.g. antlr) is able to generate the parser from a bnf grammar. But is there a way to use the same grammar to generate the formatting code?

I just want to avoid manually having to sync the parsing code (generated) with a manually written formatting code, since the grammar is the same.

could I use the abstract syntax tree? boost::spirit? metaprogramming? anyone tried this?

thanks

score 3 · Accepted Answer · answered Sep 18 '15 at 17:11

It's not clear to me whether this question is looking for an existing product or library (in which case, the question would be out-of-scope for Stack Overflow), or in algorithms for automatically generating a pretty printer from (some formalism for) a grammar. Here, I've tried to provide some pointers for the second possibility.

There is a long history of research into syntax-directed pretty printing, and a Google or Citeseer search on that phrase will probably give you lots of reading material. I'd recommend trying to find a copy of Derek Oppen's 1979 paper, Prettyprinting, which describes a linear-time algorithm based on the insertion of a few pretty-printing operators into the tokenized source code.

Oppen's basic operators are fairly simple: they consist of indications about how code segments are to be (recursively) grouped, about where newlines must and might be inserted, and about where in a group to increase indentation depth. With the set of proposed operators, it is possible to create an on-line algorithm which prefers to break lines higher up in the parse tree, avoiding the tendency to over-indent deeply-nested code, which is a classic failing of naïve indentation algorithms.

In essence, the algorithm uses a two-finger solution, where the leading finger consumes new tokens and notices when the line must be wrapped, at which point it signals the trailing finger. The trailing finger then finds the earliest point at which a newline could be inserted and all the additional newlines and indents which must be inserted to conform with the operators, advancing until there is no newline between the fingers.

The on-line algorithm might not produced optimal indentation/reflowing (and it is not immediately obvious what the definition of "optimal" might be); for certain aspects of the pretty-printing, it might be useful to think about the ideas in Donald Knuth's optimal line-wrapping algorithm, as described in his 1999 text, Digital Typography. (More references in the Wikipedia article on line wrapping.)

Oppen's algorithm is not perfect (as indicated in the paper) but it may be "good enough" for many practical purposes. (I note some limitations below.) Tracing the citation history of this paper will give you a number of implementations, improvements, and alternate algorithms.

It's clear that a parser generator could easily be modified to simply insert pretty-printing annotations into a token stream, and I believe that there have been various attempts to create yacc-like pretty-printer generators. (And possibly ANTLR derivatives, too.) The basic idea is to embed the pretty printing annotations in the grammar description, which allows the automatic generation of a reduction action which outputs an annotated token stream.

Syntax-directed pretty printing was added to the ASF+SDF Meta-Environment using a similar annotation system; the basic algorithm and formalism is described by M.G.J. van der Brand in Pretty Printing in the ASF+SDF Meta-environment Past, Present and Future (1995), which also makes for interesting reading. (ASF+SDF has since been superseded by the Rascal Metaprogramming Language, which includes visualization tools.)

One important issue with syntax-directed pretty printing algorithms is that they are based on the parse of a tokenized stream, which means that comments have already been removed. Clearly it is desirable that comments be retained in a pretty-printed version of a program, but correctly attaching comments to the relevant code is not trivial, particularly when the comment is on the same line as some code. Consider, for example, the case of a commented-out operation embedded into code:

// This is the simplified form of actual code
int needed_ = (current_ /* + adjustment_ */ ) * 2;

Or the common case of trailing comments used to document variables:

   /* Tracking the current allocation */
   int   needed_;      // Bytes required.
   int   current_;     // Bytes currently allocated.
// int adjustment_;    // (TODO) Why is this needed?
   /* Either points to the current allocation, or is 0 */
   char* buffer_;

In the above example, note the importance of whitespace: the comments may apply to the previous declaration (even though they appear after the semicolon which terminates it) or to the following declaration(s), mostly depending on whether they are suffix comments or full-line comments, but the commented-out code is an exception. Also, the programmer has attempted to line up the names of the member variables.

Another problem with automated syntax-directed pretty-printing is handling incorrect (or incomplete) programs, as would need to be done if the pretty-printing is part of a Development Environment. Error-handling (and error recovery) is by far the most difficult part of automatically-generated parsers; maintaining useful pretty printing in this context is even more complicated. It's precisely for this reason that most IDEs use a form of peephole pretty-printing (another possible search phrase), or even adaptive pretty-printing where user indentation is used as a guide to the location of as-yet-unwritten code.

thanks, but I was looking for something else: I have used a parser generator to create the parser for a given grammar, I felt that I could use something(?) to 'generate' the formatting function for the same grammar as well — chris, Sep 18 '15 at 18:57
e.g. I have a grammar for a message which has two fields and some delimiters, I declare the grammar and I use a parser generator to create the parser code for my grammar. The opposite is: I have the values for my two fields, how do I get the message? (which includes everything that is already specified in my grammar). I want a similar approach to the parsing code (_generate_ the formatting code) - and one way is to ask the parser generator tool to build the AST, and embed the formatting instructions in that AST. — chris, Sep 18 '15 at 18:59
@tcris: OK, I didn't understand what you meant by "formatting function". There are some interesting papers on reversing a parse, but most of the recent ones are based on parsing combinators and Haskell, which may not be what you're interested in. Depending on your grammar and your AST, you may find some tricky issues. For example, in an expression grammar, how do you know whether to "reverse" the production: `expr: '(' expr ')'`? (It's doable, but if the AST doesn't include the parentheses -- which it normally doesn't because they're not semantic -- then it requires some work.) — rici, Sep 18 '15 at 19:06
@tcris... of course, you can embed the phantom parentheses into the AST in various ways, but then you have to ask whether unnecessary or redundant parentheses should be inserted by the formatter. (This is just one of a large class of such issues.) — rici, Sep 18 '15 at 19:08
@tcris... You might find some of the answers here useful: http://stackoverflow.com/questions/662041/theory-examples-of-reversible-parsers If you're interested in the functional approach, some papers I've glanced at are http://www.cse.chalmers.se/~nad/publications/danielsson-correct-pretty.pdf http://www.cs.mcgill.ca/~mboes/papers/cassette.pdf http://www.informatik.uni-marburg.de/~rendel/unparse/rendel10invertible.pdf . Good luck — rici, Sep 18 '15 at 19:17
ps: all this is not for a pretty-printer, but for allowing the fields of a message to be both 1) extracted from the input message text and 2) generate the output message text (using my own values for those fields) - is the plain inverse of the parser. — chris, Sep 18 '15 at 21:00
@tcris: I get that you're not looking for a pp. And probably your xchange format is simple enough that a naive AST walk will work. But the general inverse parser problem is more complicated. The AST for an expression might be (*, a, (+, b, c)). The naive walk produces `a*b+c`, which is wrong. It's easy to produce `(a*(b+c))`, but the parentheses weren't in the productions; they had to be synthesized. And you might prefer `a*(b+c)`, but how to know when to elide the parens? — rici, Sep 18 '15 at 21:27
I guess I should use all tokens in the AST, no token should be left out of my tree. Looking at [antlr](https://stackoverflow.com/questions/4931346/how-to-output-the-ast-built-using-antlr) docs now, this can be easily done — chris, Sep 19 '15 at 23:07
@tcris: DMS leaves tokens out of tree when it is possible to regenerate them, using just the tree and an explicit model of the grammar. Thus "if" and "("...")" tokens don't appear in the tree. *Nodes* for the rule containing these tokens *do* appear in the tree. That's enough to regenerate the missing tokens. I don't think ANTLR will do this for you. — Ira Baxter, Sep 21 '15 at 09:09

Ira Baxter · Answer 2 · 2018-08-08T16:10:15.067

OP asks, Has anyone tried this?

Yes. Our DMS Software Reengineering Toolkit can do this: you give it just a grammar, you get a parser that builds ASTs and you get a prettyprinter. We've used this on lots of parse/changeAST/unparse tasks for many language over the last 20 years, preserving the meaning of the source program exactly.

The process is to parse according to the grammar, build an AST, and then walk the AST to carry out prettyprinting operations.

However, you don't get a good prettyprinter. Nice layout of the reformatted source code requires that language cues for block nesting (e.g., matching '{' ... '}', 'BEGIN' ... 'END' pairs, special keywords 'if', 'for', etc.) be used to drive the formatting and indentation. While one can guess what these elements are (as I just did), that's just a guess and in practice a human being needs to inspect the grammar to determine which things are cues and how to format each construct. (The default prettyprinter derived from the grammar makes such guesses).

DMS provides support for that problem in the form of prettyprinter declarations woven into the grammar to provide the formatter engineer quite a lot of control over the layout. (See this SO answer for detailed discussion: https://stackoverflow.com/a/5834775/120163) This produces (our opinion) pretty good prettyprinters. And DMS does have an explicit grammar/formatter for full C++14. [EDIT Aug 2018: full C++17 in MS and GCC dialects)]

EDIT: rici's answer suggests that comments are difficult to handle. He's right, in the sense that you must handle them, and yes, it is hard to handle them if they are removed as whitespace while parsing. The essense of the problem is "removed as whitespace"; and goes away if you don't do that. DMS provides means to capture the comments (rather than ignoring them as whitespace) and attach them (automatically) to AST nodes. The decision as to which AST node captures the comments is handled in the lexer by declaring comments as "pre" (happening before a token) or "post"; this decision is heuristic on the part of the grammer/lexer engineer, but works actually pretty well. The token with comments is passed to the parser, which builds an AST node from it. With comments attached to AST nodes, the prettyprinter can re-generate them, too.

@tcris Boost Spirit comes close, but you can't _actually_ share the rules. Hth — sehe, Sep 18 '15 at 20:21
I always love the downvoters on answers that address the OP's question directly, especially when they have decided the answer is based on some commercial therefore necessarily evil product. — Ira Baxter, Aug 08 '18 at 16:06
Fwiw the upvote is mine. I don't think the comment will attract much positive attention though. — sehe, Aug 08 '18 at 16:09
@sehe: Somehow it doesn't feel right that the existence of a valid answer that meets all of SO's rules should draw negative attention. What I really object to is that this kind of reaction seems to be widespread. I'm always stunned that many programmers, that AFAICT mostly make a living doing commercial things, *object* to commercial software. [PS: Thanks for the upvote]. — Ira Baxter, Aug 10 '18 at 15:43

c++ parser and formatter using a single grammar declaration

2 Answers2