Is my idea to just concentrate on parsing function and ignoring everything else (i.e. dumping them back) inherently flawed?
It's not inherently flawed. In fact, it is a common approach [Note 1]. However, your solution requires more work.
First, you need to make your lexer more robust. It should correctly identify comments and string literals. Otherwise, you risk matching false positives: apparent tokens hidden in such literals. Ideally you would also identify regexes but that is a lot more complicated since it requires cooperation from the parser, and a simplistic parser such as the one which you propose does not have enough context to distinguish division operators from the start of a regular expression. [Note 2]
You also need to recognize identifiers; otherwise, an identifier which happened to contain the characters function
(such as compare_function
) would also be a false match.
The problem arises because any
cannot contain a FUNCTION
token. So if your scanner produces a stray FUNCTION
token, the parse will fail.
Also, remember that parentheses and braces are not ANYTHING
tokens. Since a program will typically have many parentheses and braces which are not part of a function literal, you will need to add these to your any
rules. Note that you don't want to add them as single tokens; rather, you need to add parenthesis-balanced sequences ('(' any ')'
, for example). Otherwise, you will have a shift-reduce conflict on the '}'
. (function(){ var a = { };...
: how does the parser know that the }
does not close the function body?)
It will probably prove simpler to have two non-terminals, something like this [Note 3]:
any: /* empty */ { $$ = ""; }
| any any_object { $$ = $1 + $2; }
;
any_object
: ANYTHING
| fun
| '(' any ')' { $$ = $1 + $2 + $3; }
| '{' any '}' { $$ = $1 + $2 + $3; }
;
The other issue you have is that whitespace is skipped by your scanner, so your parser will never see it. That means it won't be present in the semantic values so it will be stripped by your transformation. That will break any program which depends on automatic semicolon insertion, as well as certain other constructs (return 42;
, for example; return42;
is quite different.) You will probably want to recognize whitespace as a separate token, and add it both to your any
rules (or the any_object
rule above), as well as an optional element in your fun
rule between function
and (
and between )
and {
. (Since whitespace will be included in any
, you must not add it beside an any
non-terminal; that could cause a reduce-reduce conflict.)
Speaking of automatic semicolon insertion, you would be well-advised not to rely on it in your transformed program. You should put an explicit semicolon after the inserted console.log(...)
statement.
Notes
As Ira Baxter points out in a comment, this approach is generally called an "island parser", from the idea that you are trying to find "islands" in an ocean of otherwise uninteresting text. A useful paper which I believe popularized this term is Leon Moonen's 2001 contribution to WCRE, "Generating robust parsers using island grammars". (Google will find you full-text PDFs.) Google will also find you other information about this paradigm, including Ira Baxter's own more pessimistic answer here on SO
This is probably the most serious objection to the basic idea. If you don't want to address it, you'll need to place the following restrictions on regular expressions in the programs you want to transform:
- parentheses and braces must be balanced
- the regular expression cannot contain the string
function
.
The second restriction is relatively simple, since you could replace function
with the entirely equivalent [f]unction
. The first one is more troublesome; you would need to replace /(/
with something like /\x28/
.
In your proposed grammar, there is an error because of a confusion about what any
represents. The third production for any
should not be a duplicate of the fun
production; instead, it should allow fun
to be added to an any
sequence. (Perhaps you just left out any
from that production. But even so, there is no need to repeat the fun
production when you can just use the non-terminal.)