2

I continue working on my JavaCC grammar for ECMAScript 5.1. It actually goes quite well, I think I've covered most of the expressions now.

I have now two questions, both of them are related to the automatic semicolon insertion (§7.9.1). This is one of them.

The specification defines the following production:

PostfixExpression :
    LeftHandSideExpression
    LeftHandSideExpression [no LineTerminator here] ++
    LeftHandSideExpression [no LineTerminator here] --

How can I implement a reliable "no LineTerminator here" check?

For the record my LINE_TERMINATOR is at the moment something like:

SPECIAL_TOKEN :
{
    <LINE_TERMINATOR: <LF> | <CR> | <LS> | <PS> >
|   < #LF: "\n" > /* Line Feed */
|   < #CR: "\r" > /* Carriage Return */
|   < #LS: "\u2028" > /* Line separator */
|   < #PS: "\u2029" > /* Paragraph separator */
}

I have read about lexical states, but I am not sure if this is a right direction. I've checked a few other JavaScript grammars I have found, but did not find any similar rules there. (I actually feel myself a total cargo culter when I try to overtake something from these grammars.)

I'd be grateful for a pointer, a hint or just a keyword for the right search direction.

lexicore
  • 42,748
  • 17
  • 132
  • 221

2 Answers2

3

I think for the "restricted productions" you can do this

void PostfixExpression() : 
{} {
     LeftHandSideExpression() 
     (
         LOOKAHEAD( "++", {getToken(0).beginLine == getToken(1).beginLine})
         "++"
     |
         LOOKAHEAD( "--", {getToken(0).beginLine == getToken(1).beginLine})
         "--"
     |
         {}
     )
}
Theodore Norvell
  • 15,366
  • 6
  • 31
  • 45
1

Update As Gunther pointed out, my original solution was not correct due to this paragraph in 7.4 of the spec:

Comments behave like white space and are discarded except that, if a MultiLineComment contains a line terminator character, then the entire comment is considered to be a LineTerminator for purposes of parsing by the syntactic grammar.

I'm posting a correction but leaving my original solution at the end of the question.

Corrected solution

The core idea, as proposed by Theodore Norvell is to use semantic lookahead. However I have decided to implement a more safe check:

public static boolean precededByLineTerminator(Token token) {
    for (Token specialToken = token.specialToken; specialToken != null; specialToken = specialToken.specialToken) {
        if (specialToken.kind == EcmaScriptParserConstants.LINE_TERMINATOR) {
            return true;
        } else if (specialToken.kind == EcmaScriptParserConstants.MULTI_LINE_COMMENT) {
            final String image = specialToken.image;
            if (StringUtils.containsAny(image, (char)0x000A, (char)0x000D, (char)0x2028,
                    (char)0x2029)) {
                return true;
            }
        }
    }
    return false;
}

And the grammar is:

expression = LeftHandSideExpression()
(
    LOOKAHEAD ( <INCR>, { !TokenUtils.precededByLineTerminator(getToken(1))} )
    <INCR>
    {
        return expression.postIncr();
    }
|   LOOKAHEAD ( <DECR>, { !TokenUtils.precededByLineTerminator(getToken(1))} )
    <DECR>
    {
        return expression.postDecr();
    }
) ?
{
    return expression;
}

So the ++ or -- are considered here iff they are not preceded by a line terminator.


Original solution

This not is how I finally solved it.

The core idea, as proposed by Theodore Norvell is to use semantic lookahead. However I have decided to implement a more safe check:

public static boolean precededBySpecialTokenOfKind(Token token, int kind) {
    for (Token specialToken = token.specialToken; specialToken != null; specialToken = specialToken.specialToken) {
        if (specialToken.kind == kind) {
            return true;
        }
    }
    return false;
}

And the grammar is:

expression = LeftHandSideExpression()
(
    LOOKAHEAD ( <INCR>, { !TokenUtils.precededBySpecialTokenOfKind(getToken(1), LINE_TERMINATOR)} )
    <INCR>
    {
        return expression.postIncr();
    }
|   LOOKAHEAD ( <DECR>, { !TokenUtils.precededBySpecialTokenOfKind(getToken(1), LINE_TERMINATOR)} )
    <DECR>
    {
        return expression.postDecr();
    }
) ?
{
    return expression;
}

So the ++ or -- are considered here iff they are not preceded by a line terminator.

lexicore
  • 42,748
  • 17
  • 132
  • 221
  • Does it also handle the case where the line terminator is buried in a MultiLineComment? ECMA-262 specifies that it is disallowed in that case either. – Gunther Mar 18 '15 at 10:45
  • @Gunther Good question. No, it probably does not. I'll check. – lexicore Mar 18 '15 at 12:02
  • I guess that the original proposal would have covered that case, wouldn't it? A while ago I built something similar, when adding support for automatic semicolon insertion to [REx parser generator](http://bottlecaps.de/rex). You can find it by looking at method followsLineTerminator() in a parser generated from [EcmaScript.ebnf](http://bottlecaps.de/rex/EcmaScript.ebnf) with `-asi` turned on – Gunther Mar 18 '15 at 17:24
  • @Gunther In principle, yes. However, it assumes that parsers understanding of the newline is the same like in the ES grammer. I'm not 100% this is the case. Therefore this other solution. – lexicore Mar 18 '15 at 19:07
  • @Gunther You are right, I indeed had a bug there due to that part of the spec I was missing. Corrected now, please see the update. Thank you for your insight! – lexicore Mar 19 '15 at 08:04