Writing Bison grammar to recognise JavaScript functions and ignore everything else

Question

I am trying to:

Read my JavaScript code through my grammar
Write a particular line inside the body of each function.

For example,

Input

function(){
    console.log('this is some function');
}
function somefunc (args){
    console.log('this is some other function');
    function(){
         console.log('function inside a function');
    }
}

Output

    function(){
        //auto generated debug log
        debug('some text');
        console.log('this is some function');
    }
    function somefunc (args){
       //auto generated debug log
       debug('some text');
       console.log('this is some other function');
       function(){
          //auto generated debug log
          debug('some text');
          console.log('function inside a function');
      }
   }

I am able to recognize the function, but I am not able to put everything in a single piece. I need some breadcrumbs to move in the correct direction of writing grammar to achieve the desired output.

Is my idea to just concentrate on parsing function and ignoring everything else (i.e., dumping them back) inherently flawed?

So far, I have done:

/* lexical grammar */
%lex
%%

\s+                   /* skip whitespace */
"("                   return '('
")"                   return ')'
<<EOF>>               return 'EOF'
"function"          return 'FUNCTION'
"{"              return '{'
"}"              return '}'
.              return 'ANYTHING'
/lex

/* operator associations and precedence */


%start expr

%% /* language grammar */

expr: any EOF
      {typeof console !== 'undefined' ? console.log($1) : print($1);
          return $1;}
;
fun: FUNCTION '(' any ')' '{' any '}'
     {$$=$1+$2+$3+$4+$5+'console.log(something)'+$6+$7}
;

any: ANYTHING
     {$$=$1;}
   | any ANYTHING
      {$$=$1+$2}
   |  FUNCTION '(' any ')' '{' any '}'
     {$$=$1+$2+$3+$4+$5+'console.log(something)'+$6+$7}
;

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

Is my idea to just concentrate on parsing function and ignoring everything else (i.e. dumping them back) inherently flawed?

It's not inherently flawed. In fact, it is a common approach [Note 1]. However, your solution requires more work.

First, you need to make your lexer more robust. It should correctly identify comments and string literals. Otherwise, you risk matching false positives: apparent tokens hidden in such literals. Ideally you would also identify regexes but that is a lot more complicated since it requires cooperation from the parser, and a simplistic parser such as the one which you propose does not have enough context to distinguish division operators from the start of a regular expression. [Note 2]

You also need to recognize identifiers; otherwise, an identifier which happened to contain the characters function (such as compare_function) would also be a false match.

The problem arises because any cannot contain a FUNCTION token. So if your scanner produces a stray FUNCTION token, the parse will fail.

Also, remember that parentheses and braces are not ANYTHING tokens. Since a program will typically have many parentheses and braces which are not part of a function literal, you will need to add these to your any rules. Note that you don't want to add them as single tokens; rather, you need to add parenthesis-balanced sequences ('(' any ')', for example). Otherwise, you will have a shift-reduce conflict on the '}'. (function(){ var a = { };...: how does the parser know that the } does not close the function body?)

It will probably prove simpler to have two non-terminals, something like this [Note 3]:

any: /* empty */    { $$ = ""; }
   | any any_object { $$ = $1 + $2; }
   ;
any_object
   : ANYTHING
   | fun
   | '(' any ')'    { $$ = $1 + $2 + $3; }
   | '{' any '}'    { $$ = $1 + $2 + $3; }
   ;

The other issue you have is that whitespace is skipped by your scanner, so your parser will never see it. That means it won't be present in the semantic values so it will be stripped by your transformation. That will break any program which depends on automatic semicolon insertion, as well as certain other constructs (return 42;, for example; return42; is quite different.) You will probably want to recognize whitespace as a separate token, and add it both to your any rules (or the any_object rule above), as well as an optional element in your fun rule between function and ( and between ) and {. (Since whitespace will be included in any, you must not add it beside an any non-terminal; that could cause a reduce-reduce conflict.)

Speaking of automatic semicolon insertion, you would be well-advised not to rely on it in your transformed program. You should put an explicit semicolon after the inserted console.log(...) statement.

Notes

As Ira Baxter points out in a comment, this approach is generally called an "island parser", from the idea that you are trying to find "islands" in an ocean of otherwise uninteresting text. A useful paper which I believe popularized this term is Leon Moonen's 2001 contribution to WCRE, "Generating robust parsers using island grammars". (Google will find you full-text PDFs.) Google will also find you other information about this paradigm, including Ira Baxter's own more pessimistic answer here on SO
This is probably the most serious objection to the basic idea. If you don't want to address it, you'll need to place the following restrictions on regular expressions in the programs you want to transform:
- parentheses and braces must be balanced
- the regular expression cannot contain the string function.
The second restriction is relatively simple, since you could replace function with the entirely equivalent [f]unction. The first one is more troublesome; you would need to replace /(/ with something like /\x28/.
In your proposed grammar, there is an error because of a confusion about what any represents. The third production for any should not be a duplicate of the fun production; instead, it should allow fun to be added to an any sequence. (Perhaps you just left out any from that production. But even so, there is no need to repeat the fun production when you can just use the non-terminal.)

Love your answers. You might note the term for this is "island parser", where the "island" is the text you want to keep, and the "water" is stuff you want to ignore. — Ira Baxter, May 13 '16 at 17:31
@IraBaxter: Yes, that's worth mentioning for its googlability. I added it with a footnote. — rici, May 13 '16 at 17:42
thanks for such an elaborate response. I have been trying my whole weekend to fill the holes in my grammar (especially the space issue). It seems i should start afresh after revisiting the necessary concepts. — palash kulshreshtha, May 16 '16 at 05:18

Writing Bison grammar to recognise JavaScript functions and ignore everything else

1 Answers1

Notes