I'm working on a project to convert scripts written in Second Life's LSL language into Lua. I'm learning flex and btyacc for this project. Actually, it's a much larger project, this is just a part of it. http://www.sqlite.org/cgi/src As a first step, I want to ensure that the AST I produce is an accurate reflection of the input. So my idea is to generate a new file from that AST, then compare them. This means I need to include white space and comments in the AST, so that when I use it to gerenate the result file, it contains the exact same white space and comments.
I've been having problems with dealing with white space. Searching and experimenting for days and not getting anywhere. Every example I see just ignores the white space, not storing it for later use. I guess I'll have the exact same problem with comments, as they are basically just another form of space to be ignored.
I would have thought this was a standard thing to do, but I just can't find any examples. Anyone know of examples of similar things?
The source code is on github if someone wants to look over it and suggest an approach.
https://github.com/onefang/SledjHamr/blob/experimental/LuaSL/src LuaSL_yaccer.l
, LuaSL_yaccer.y
, LuaSL_LSL_tree.h
, and LuaSL_LSL_tree.c
The flex line that recognises spaces has it's action commented out. If I un-comment, I get a parse error.
SOLUTION
I made use of bdonlan's solution, but I transferred the project over to using lemon instead of btyacc in the middle of implementing it. This is what I did. The source code below is simplified.
With lemon you create a loop that calls the lexer, then you take the result of the lexer call and pass it to the parser. My loop allocates a new yylval structure each time, this structure contains a char pointer for the white space or comment. I use this yylval as my AST node structure as well, since it already contains most of what I need, and saves having to spend time reallocating memory or copying stuff around.
struct AST_node
{
struct AST_node *left;
struct AST_node *right;
char *white_space;
struct AST_token *token; /* common() stashes a pointer to a token structure here. */
int line, column;
union
{
/* The various types of symbols are stored here as per usual, including this one - */
int operationValue;
} value;
}
struct AST_node *lval = calloc(1, sizeof(struct AST_node);
while((yv = yylex(lval)) != 0)
{
Parse(pParser, yv, lval);
lval = calloc(1, sizeof(struct AST_node));
}
In the lexer I have a common() function that is called from each regular expressions action section. Among other things, if it's a comment or white space, I save the text to a static accumulator. If it's not a comment or white space, then I store the accumulator (if it exists) in the yylval structure, and clear the accumulator. This accumulator gathers together white space AND comments, so any given one in the yylval can contain both.
The lexer does not return a symbol if it's just white space / comments, thus it accumulates them until it gets around to emitting an actual symbol.
In lemon all terminals and non-terminals are the type used for yylval, so I can pass this to the C code in action sections. For example -
expr(A) ::= expr(B) LSL_ADD(D) expr(C). { A = addOperation(D, LSL_ADD, B, C); }
LSL_ADD is the symbol emitted from the lexer, and D is it's value, which is the yylval I created in the main loop to pass to the lexer. In this case, addOperation() adds the pointers to the left and right AST nodes (B and C) into the AST structure (D), and tucks the symbol in there to (so that I later know this particular operation is an addition).
struct AST_node *addOperation(struct AST_node *lval, int type, struct AST_node *left, struct AST_node *right)
{
lval->left = left;
lval->right = right;
lval->value.operationValue = type;
return lval;
}
Later, when I'm reconstructing the original source code from the AST, I just output the white space / comment before outputting the symbol in the same AST node.
Yes, I know there is some duplication in the above code, no need to stash the type in both the token member and the union. I'll clean that out of my code later. For now though, it servers to illustrate what's going on.