0

I am finishing my ECMAScript 5.1/JavaScript grammar for JavaCC. I've done all the tokens and productions according to the specification.

Now I'm facing a big question which I don't know how to solve.

JavaScript has this nice feature of the automatic semicolon insertion:

What are the rules for JavaScript's automatic semicolon insertion (ASI)?

To quote the specifications, the rules are:

There are three basic rules of semicolon insertion:

  1. When, as the program is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true:

    • The offending token is separated from the previous token by at least one LineTerminator.
    • The offending token is }.
  2. When, as the program is parsed from left to right, the end of the input stream of tokens is encountered and the parser is unable to parse the input token stream as a single complete ECMAScript Program, then a semicolon is automatically inserted at the end of the input stream.

  3. When, as the program is parsed from left to right, a token is encountered that is allowed by some production of the grammar, but the production is a restricted production and the token would be the first token for a terminal or nonterminal immediately following the annotation [no LineTerminator here] within the restricted production (and therefore such a token is called a restricted token), and the restricted token is separated from the previous token by at least one LineTerminator, then a semicolon is automatically inserted before the restricted token.

However, there is an additional overriding condition on the preceding rules: a semicolon is never inserted automatically if the semicolon would then be parsed as an empty statement or if that semicolon would become one of the two semicolons in the header of a for statement (see 12.6.3).

How could I implement this with JavaCC?

The closes thing to an answer I've found so far is this grammar from Dojo toolkit which has a JAVACODE part called insertSemiColon dedicated to the task. But I don't see that this method is called anywhere (neither in the grammar nor in the whole jslinker code).

How could I approach this problem with JavaCC?

See also this question:

javascript grammar and automatic semocolon insertion

(No answer there.)

A question from the comments:

Is it correct to say that semicolons need only be inserted where semicolons are syntactically allowed?

I think it would be correct to say that semicolons need only be inserted where semicolons are syntactically required.

The relevant part here is §7.9:

7.9 Automatic Semicolon Insertion

Certain ECMAScript statements (empty statement, variable statement, expression statement, do-while statement, continue statement, break statement, return statement, and throw statement) must be terminated with semicolons. Such semicolons may always appear explicitly in the source text. For convenience, however, such semicolons may be omitted from the source text in certain situations. These situations are described by saying that semicolons are automatically inserted into the source code token stream in those situations.

Let's take the return statement for instance:

ReturnStatement :
    return ;
    return [no LineTerminator here] Expression ;

So (from my understanding) syntactically the semicolon is required, not just allowed (as in your question).

Community
  • 1
  • 1
lexicore
  • 42,748
  • 17
  • 132
  • 221
  • Is it correct to say that semicolons need only be inserted where semicolons are syntactically allowed? – Theodore Norvell Nov 22 '14 at 20:00
  • See also: http://stackoverflow.com/questions/15068782/automatic-semicolon-insertion-in-javascript-without-parsing – lexicore Nov 22 '14 at 21:49
  • [This](https://code.google.com/p/yaji-ecmascript-interpreter/source/browse/trunk/yaji-ecmascript-interpreter/src/FESI/Parser/EcmaScript.jjt) gives some clues. – lexicore Nov 22 '14 at 22:59

1 Answers1

1

The 3 rules for semicolon insertion can be found in section 7.9.1 of the ECMAScript 5.1 standard

I think rules 1 and 2 from the standard can be handled with semantic lookahead.

void PossiblyInsertedSemicolon() 
{}
{
    LOOKAHEAD( {semicolonNeedsInserting()} ) {}
|
    ";"
}

So when does a semicolon need inserting? When one of these is true

  • When the next token is not a semicolon and is on another line (getToken(1).kind != SEMICOLON && getToken(0).endLine < getToken(1).beginLine)
  • When the next token is a right brace.
  • When the next token is EOF

So we need

boolean semicolonNeedsInserting() {
    return (`getToken(1).kind != SEMICOLON && getToken(0).endLine < getToken(1).beginLine`) 
    || getToken(1).kind == RBRACE
    || getToken(1).kind == EOF ;
}

That takes care of rules 1 and 2 of the standard.

For rule 3 (restricted productions) , as mentioned in my answer to this question, you could do the following

void returnStatement()
{}
{
    "return"
    [   // Parse an expression unless either the next token is a ";", "}" or EOF, or the next token is on another line.
        LOOKAHEAD( {   getToken(1).kind != SEMICOLON
                    && getToken(1).kind != RBRACE
                    && getToken(1).kind != EOF
                    && getToken(0).endLine == getToken(1).beginLine} )
        Expression()
    ]
    PossiblyInsertedSemicolon() 
}
Community
  • 1
  • 1
Theodore Norvell
  • 15,366
  • 6
  • 31
  • 45
  • You are most probably right. I see something similar in [this grammar](https://code.google.com/p/yaji-ecmascript-interpreter/source/browse/trunk/yaji-ecmascript-interpreter/src/FESI/Parser/EcmaScript.jjt). Only the check for EOL is implemented differently (via special token). I'll try it out and report later on. – lexicore Nov 24 '14 at 08:17
  • I think when you try it you will find that there are a large number of ambiguity warnings. This would be annoying. However it is better to have a grammar that is simple and works than one that is free of warnings but incorrect or complex. – Theodore Norvell Nov 24 '14 at 13:35
  • Could you please explain, what you meant by *when you try **it***? Did you mean your approach or the one from [this grammar](https://code.google.com/p/yaji-ecmascript-interpreter/source/browse/trunk/yaji-ecmascript-interpreter/src/FESI/Parser/EcmaScript.jjt). [My own grammar](https://github.com/highsource/javascript-codemodel/blob/master/parser/src/main/javacc/ecmascript-262.jj) is warnings-free at the moment (thanks to you, by the way), but I'm not done yet. Automatic semicolon insertion is one remaining issue and regular expressions is the other – lexicore Nov 24 '14 at 14:56
  • Your [FAQ](http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm) is a great help. I'd never got so far without it. – lexicore Nov 24 '14 at 14:58
  • What I meant is that if you take a warning free grammar that doesn't do ASI and replace the various uses of ";" with `PossiblyInsertedSemicolon()` as defined above, that will create syntactic ambiguity. E.g. the string "a - b;" can now be parsed (according to the grammar) as either one or two expression statements. That ambiguity will mean warnings. You will then have to look at the warnings and decide whether or not they are safe to ignore and/or suppress. I'm fairly certain they will be safe to ignore; i.e. the default choices will be the right ones. – Theodore Norvell Nov 25 '14 at 18:42
  • @TheodoreNorvell, I'm very new to language design and was wondering if you could please tell me if I'm interpreting rule 2 correctly. My reading of it is that if the lookahead token is an illegal EOF, then rule 2 kicks in and inserts a semicolon (i.e. terminates the current statement). This happens if the last statement in the program is an unterminated non-compound statement - we need to insert a semicolon after such a statement in order to make it syntactically valid so that the whole program can then be parsed as a valid instance of Script or Module. Is this correct? – user51462 Sep 22 '21 at 23:55
  • @user51462 I'm no expert on EcmaScript. What you say sounds right to me, except, I can't find the term "unterminated" in the 2022 standard (https://tc39.es/ecma262/#sec-automatic-semicolon-insertion). What is says is that certain statement kinds must be terminated with a semicolon see section 12.9. As I understand it, rules 1, 2 and 3 apply only to these statement kinds. So rule 2 would only kick in when the last non EOF token of a "file" belongs to such a statement (or would if a semicolon were inserted). – Theodore Norvell Sep 28 '21 at 12:49
  • @user51462 But rule 2 says that we should only do the semicolon insertion if without the semicolon the whole "goal nonterminal" wouldn't succeed. ("Goal symbol" means Script or Module, like you said.) For example if the input script is `{console.log(1)` then without the semicolon its not a script, so rule 2 applies and we get `{console.log(1);` which is also not a script, so now there is a syntax error. – Theodore Norvell Sep 28 '21 at 13:09
  • @user51462 Now you said that you are new to "language design". My advice, if you are designing a language, is to avoid the need for explicit (or implicit) terminators or separators. See the Turing language as an example of how to do this. – Theodore Norvell Sep 28 '21 at 13:13