1

I'm working on a project to convert scripts written in Second Life's LSL language into Lua. I'm learning flex and btyacc for this project. Actually, it's a much larger project, this is just a part of it. http://www.sqlite.org/cgi/src As a first step, I want to ensure that the AST I produce is an accurate reflection of the input. So my idea is to generate a new file from that AST, then compare them. This means I need to include white space and comments in the AST, so that when I use it to gerenate the result file, it contains the exact same white space and comments.

I've been having problems with dealing with white space. Searching and experimenting for days and not getting anywhere. Every example I see just ignores the white space, not storing it for later use. I guess I'll have the exact same problem with comments, as they are basically just another form of space to be ignored.

I would have thought this was a standard thing to do, but I just can't find any examples. Anyone know of examples of similar things?

The source code is on github if someone wants to look over it and suggest an approach.

https://github.com/onefang/SledjHamr/blob/experimental/LuaSL/src LuaSL_yaccer.l, LuaSL_yaccer.y, LuaSL_LSL_tree.h, and LuaSL_LSL_tree.c

The flex line that recognises spaces has it's action commented out. If I un-comment, I get a parse error.

SOLUTION

I made use of bdonlan's solution, but I transferred the project over to using lemon instead of btyacc in the middle of implementing it. This is what I did. The source code below is simplified.

With lemon you create a loop that calls the lexer, then you take the result of the lexer call and pass it to the parser. My loop allocates a new yylval structure each time, this structure contains a char pointer for the white space or comment. I use this yylval as my AST node structure as well, since it already contains most of what I need, and saves having to spend time reallocating memory or copying stuff around.

struct AST_node
{
    struct AST_node  *left;
    struct AST_node  *right;
    char             *white_space;
    struct AST_token *token;    /* common() stashes a pointer to a token structure here.  */
    int               line, column;
    union
    {
        /* The various types of symbols are stored here as per usual, including this one - */
        int operationValue;
    } value;
}

struct AST_node *lval = calloc(1, sizeof(struct AST_node);

while((yv = yylex(lval)) != 0)
{
    Parse(pParser, yv, lval);
    lval = calloc(1, sizeof(struct AST_node));
}

In the lexer I have a common() function that is called from each regular expressions action section. Among other things, if it's a comment or white space, I save the text to a static accumulator. If it's not a comment or white space, then I store the accumulator (if it exists) in the yylval structure, and clear the accumulator. This accumulator gathers together white space AND comments, so any given one in the yylval can contain both.

The lexer does not return a symbol if it's just white space / comments, thus it accumulates them until it gets around to emitting an actual symbol.

In lemon all terminals and non-terminals are the type used for yylval, so I can pass this to the C code in action sections. For example -

expr(A) ::= expr(B) LSL_ADD(D) expr(C).  { A = addOperation(D, LSL_ADD, B, C); }

LSL_ADD is the symbol emitted from the lexer, and D is it's value, which is the yylval I created in the main loop to pass to the lexer. In this case, addOperation() adds the pointers to the left and right AST nodes (B and C) into the AST structure (D), and tucks the symbol in there to (so that I later know this particular operation is an addition).

struct AST_node *addOperation(struct AST_node *lval, int type, struct AST_node *left, struct AST_node *right)
{
    lval->left = left;
    lval->right = right;
    lval->value.operationValue = type;
    return lval;
}

Later, when I'm reconstructing the original source code from the AST, I just output the white space / comment before outputting the symbol in the same AST node.

Yes, I know there is some duplication in the above code, no need to stash the type in both the token member and the union. I'll clean that out of my code later. For now though, it servers to illustrate what's going on.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
David Seikel
  • 53
  • 1
  • 8
  • I'm new to this site, and not sure what the point of Sangeeth's edit was. I liked my spaces just the way they where, and the addition of a hyphen to "uncomment" just seems overly pedantic to me. shrugs – David Seikel Jan 09 '12 at 11:21
  • yeah, if you ask me everything but the filename markup was excessive. That's the downside of the 'suggested edits' feature - relatively new users put in overzealous edits, and as long as they don't look outright _wrong_ they tend to go through... :/ If you don't like an edit, always feel free to rollback your own post. – bdonlan Jan 10 '12 at 00:34
  • Check out my SO answer on how to regenerate source text from an AST. It discusses what you have to capture and provides some ideas as to how. See http://stackoverflow.com/a/5834775/120163 – Ira Baxter May 07 '16 at 20:48

1 Answers1

3

It's rare to have the AST be an exact, reversible transformation of the original source. For example, parenthesis might be lost in the parsing process (used only for precedence but elided in the final AST), or whitespace might be eliminated.

It is rather common to store line and/or character offsets for tokens so that error messages can cite their origin, but this is not enough for full reversibility.

I would recommend not having a fully reversible AST, but instead having a test suite of known AST results and inputs that produce them. But if you must, you could store all whitespace along with terminal tokens - for example, if you have code like:

1 + /* this is a comment */ 2

then the AST node corresponding to + would contain the single space before the +, and the node for 2 would contain /* this is a comment */ as additional whitespace data. Then, when you reverse the transformation, you can restore this whitespace as you go. Naturally, this will require explicit encoding of syntax features such as parenthesis as well.

With lex/yacc, you might implement this by maintaining a separate accumulator of whitespace/comments (or indexes into an input buffer); when you see whitespace or comments, update this accumulator but do not emit a token. When you hit any other token, including EOF, move this data to a different accumulator and reset the main accumulator. Once you return to yacc, your yacc terminals can then examine this secondary accumulator, and stash them into whatever data structure you assign to the terminal.

bdonlan
  • 224,562
  • 31
  • 268
  • 324
  • I am tracking things like parenthesis as well, and will track commments once I get white space sorted out. I also have a large amount of pre existing scripts, but "known results" is kinda hard to encode for them, given the nature of what these scripts do. So the easy thing to do is reverse the AST and compare to the large body of pre existing scripts. – David Seikel Jan 09 '12 at 11:12
  • I like your accumulator idea. I'll experiment with that and see how I go. – David Seikel Jan 09 '12 at 11:17
  • I'm new to this site, and apparently I don't have enough reputation to vote your answer up. – David Seikel Jan 09 '12 at 11:19
  • Got it to work, mostly. At least it's good enough that I'll move onto the next step, which is to try out lemon instead of btyacc. The same principle should apply, so I'll accept this answer. – David Seikel Jan 09 '12 at 17:57
  • When I say 'known results', I'm referring to handcrafted input files with corresponding ASTs – bdonlan Jan 09 '12 at 22:39
  • With so many existing scripts that I have access to, in a language that's full of all sorts of quirks, the best method of making sure I can accurately convert LSL to Lua is to first convert the LSL to LSL and run diff over the results. If they are identical after passing through my AST, then I know the AST is good enough to turn into Lua. It passed that test already using lemon, so now I'm on the next step, generating Lua code. – David Seikel Feb 03 '12 at 10:26