2

I have a data file format which includes

  • /* comments */
  • /* nested /* comments */ too */ and
  • // c++ style single-line comments..

As usual, these comments can occur everywhere in the input file where normal white space is allowed.

Hence, rather than pollute the grammar proper with pervasive comment-handling, I have made a skipper parser which will handle white space and the various comments.

So far so good, and i am able to parse all my test cases.

In my use case, however, any of the parsed values (double, string, variable, list, ...) must carry the comments preceding it as an attribute, if one or more comments are present. That is, my AST node for double should be

struct Double {
   double value;
   std::string comment;
};

and so forth for all the values I have in the grammar.

Hence I wonder if it is possible somehow to "store" the collected comments in the skipper parser, and then have them available for building the AST nodes in the normal grammar?

The skipper which processes comments:

template<typename Iterator>
struct SkipperRules : qi::grammar<Iterator> {
    SkipperRules() : SkipperRules::base_type(skipper) {
        single_line_comment = lit("//") >> *(char_ - eol) >> (eol | eoi);
        block_comment = ((string("/*") >> *(block_comment | char_ - "*/")) >> string("*/"));
        skipper = space | single_line_comment | block_comment;
    }
    qi::rule<Iterator> skipper;
    qi::rule<Iterator, std::string()> block_comment;
    qi::rule<Iterator, std::string()> single_line_comment;
};

I can store the commments using a global variable and semantic actions in the skipper rule, but that seems wrong and probably won't play well in general with parser backtracking. What's a good way to store the comments so they are later retrievable in the main grammar?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214

1 Answers1

2

I can store the commments using a global variable and semantic actions in the skipper rule, but that seems wrong and probably won't play well in general with parser backtracking.

Good thinking. See Boost Spirit: "Semantic actions are evil"?. Also, in your case it would unnecessarily complicate the correlation of source location with the comment.

can I collect attributes from my skipper parser?

You cannot. Skippers are implicitly qi::omit[] (like the separator in the Kleene-% list, by the way).

In my use case, however, any of the parsed values (double, string, variable, list, ...) must carry the comments preceding it as an attribute, if one or more comments are present. That is, my AST node for double should be

struct Double {
   double value;
   std::string comment;
};

There you have it: your comments are not comments. You need them in your AST, so you need them in the grammar.

Ideas

I have several ideas here.

  1. You could simply not use the skipper to soup up the comments, which, like you mention, is going to be cumbersome/noisy in the grammar.

  2. You could temporarily override the skipper to just be qi::space at the point where the comments are required. Something like

    value_ = qi::skip(qi::space) [ comment_ >> (string_|qi::double_|qi::int_)  ];
    

    Or given your AST, maybe a bit more verbose

    value_ = qi::skip(qi::space) [ comment_ >> (string_|double_|int_) ];
    string_ = comment_ >> lexeme['"' >> *('\\' >> qi::char_ | ~qi::char_('"')) >> '"'];
    double_ = comment_ >> qi::real_parser<double, qi::strict_real_policies<double> >{};
    int_    = comment_ >> qi::int_;
    

    Notes:

    • in this case make sure the double_, string_ and int_ are declared with qi::space_type as the skipper (see Boost spirit skipper issues)
    • the comment_ rule is assumed to expose a std::string() attribute. This is fine if used in the skipper context as well, because the actual attribute will be bound to qi::unused_type which compiles down to no-ops for attribute propagation.
    • As a subtler side note I made sure to use strict real policies in the second snippet so that the double-branch won't eat integers as well.
  3. A fancy solution might be to store the souped up comment(s) into a "parser state" (e.g. member variable) and then using on_success handlers to transfer that value into the rule attribute on demand (and optionally flush comments on certain rule completions).

    I have some examples of what can be achieved using on_success for inspiration: https://stackoverflow.com/search?q=user%3A85371+on_success+qi. (Specifically look at the way position information is being added to AST nodes. There's a subtle play with fusion-adapted struct vs. members that are being set outside the control of autmatic attribute propagation. A particularly nice method is to use a base-class that can be generically "detected" so AST nodes deriving from that base magically get the contextual comments added without code duplication)

    Effectively this is a hybrid: yes you use semantic actions to "side-channel" the comment values. However, it's less unwieldy because now you can deterministically "harvest" those values in the on-success handler. If you don't prematurely reset the comments, it should even generically work well under backtracking.

    A gripe with this is that it will be slightly less transparent to reason about the mechanics of "magic comments". However, it does sit well for two reasons:

    - "magic comments" are a semantic hack whichever way you look at it, so it matches the grammar semantics in the code
    - it does succeed at removing comment noise from productions, which is effectively what the comments were from in the first place: they were embellishing the semantics without complicating the language grammar.
    

I think option 2. is the "straight-forward" approach that you might not have realized. Option 3. is the fancy approach, in case you want to enjoy the greater genericity/flexibility. E.g. what will you do with

  /*obsolete*/ /*deprecated*/ 5.12e7

Or, what about

  bla = /*this is*/ 42 /*also relevant*/;

These would be easier to deal with correctly in the 'fancy' case.

So, if you want to avoid complexity, I suggest option 2. If you need the flexibility, I suggest option 3.

sehe
  • 374,641
  • 47
  • 450
  • 633
  • Thank you for your thorough response. I am well aware of the odd examples you present, with multiple comments preceding a node, or comments following a node. Up front the way to handle those is pragmatic -- the existing hand-written parser simply pushes onto a comment stack comments as they are encountered, then merge them into one and sticks them onto whichever node is created next. I think approach 3 is the way to go to mimic this and have flexibility to improve it later. At least I have something to try out now. – Jacob Lorensen May 14 '21 at 15:06
  • Yup. That's sounds like a good analysis. I'm ready to help if you have some specific code to help with – sehe May 14 '21 at 17:05
  • it works --- to a degree. I am now diving into unit tests for all the odd examples you also mentioned, and frankly, the parser trying out/backtracking over "value = double|string|list|struct|table|..." makes my skipper's comment-collector collect the same comment multiple times. Ie. the fragment: "a = /*hi*/10" would push /*hi*/ on the comment stack 6 times as the rule alternatives are tried. My thinking is to keep track of the parser progress and only push comments if the comment is later in the input stream then what was processed so far. Was that what you were thinking? – Jacob Lorensen May 17 '21 at 08:02
  • I was thinking of deterministically gating the collection (think `qi::eps[some_action]`). And qi::hold in case you need it. – sehe May 17 '21 at 14:32
  • I have been trying for hours to get my head around your suggestion. and how I could employ eps/hold to achieve what I want, to no avail. I mean, any place where a white-space is allowed, a comment is allowed and should be collected... to be available for the next AST node. In essence, no gating: collection should conceptually always be on. – Jacob Lorensen May 17 '21 at 20:25
  • If you have code to share I'm happy to try to look at it. I can also come up with something unilaterally – sehe May 18 '21 at 10:17
  • I can make a smaller example, hopefully this evening, when I get back from work. In the mean time I have come up with an alternaive idea I may explore: Split parsing into 2 parsing layers. One that parses comments-and-tokens collecting comments and attaching them to the relevant comment-bearing tokens; and a second parsing layer, which will parse the comment-bearing token stream and create the real AST. This way I could have the best of both worlds without going through special eps/hold hoops, at the cost of 2 passes. – Jacob Lorensen May 18 '21 at 12:14
  • I have made a smallish example of my parser on https://github.com/jablo/spiritparser – Jacob Lorensen May 18 '21 at 20:20
  • @JacobLorensen I started out with a review/refactor/simplify as per usual: https://github.com/sehe/spiritparser/tree/sehe Will look at the actual problem later. – sehe May 19 '21 at 16:02
  • @JacobLorensen can you specify what nodes you expect to carry what comments? I'm a bit confused that you're assigning them at serveral levels in the AST hierarchy, and its not clear whether you wanted to have them repeat (not per node, but across nodes) – sehe May 19 '21 at 21:44
  • Comments should never be duplicated. They should be collected until the next AST node is created. Ie. the example: "/*c1*/ obj /*c2/* type /*c3*/ ( /*c4*/ attr /*c5*/: /*c6*/ 10 /*c7*/)", should result in: `ObjectDef{ Identifier{"myobj", CommentMixin{{"/*c1*/"}}}, Identifier{"sometype", CommentMixin{{"/*c2*/"}}}, {AttributeDef{ {Identifier{"attr", CommentMixin{{"/*c3*/", "/*c4*"}}}}, AttributeValue{DoubleVal{CommentMixin{{"/*c5*/", "/*c6*"}}, 10.0}}}}}; Comment /*c7*/ is lost (but would appear on the next Identifier whatever it were) – Jacob Lorensen May 20 '21 at 06:28
  • Hah. It took me a while to spot the typo in `/*c2/*` :) Other than that, this seems simpler than the stuff I came up with, so I pushed a new commit: https://github.com/sehe/spiritparser/commit/9dc57558932401355392133a4e44bd1c29d2ad00 Read the commit message for details. – sehe May 20 '21 at 21:47
  • Also, see the [code around `#ifdef TRUESTACK`](https://github.com/sehe/spiritparser/commit/9dc57558932401355392133a4e44bd1c29d2ad00#diff-608d8de3fba954c50110b6d7386988f27295de845e9d7174e40095ba5efcf1bbR157) (which was overcomplicating, apparently). The main difference in behaviour is that my take naturally include `/*c7*/` which I feel does make more sense because it does precede `')'` or `','`? – sehe May 20 '21 at 21:50
  • 1
    Thanks for the example. It looks very good. I agree with the /*c7*/ handling --- the existing parser doesn't do it that way, but it is much more logical and I will want to keep that. You definitely more than answered the question, and I have something to work with. Thanks a lot. – Jacob Lorensen Jun 02 '21 at 20:12