I am currently developing a JavaScript/ECMAScript 5.1 parser with JavaCC. RegularExpressionLiteral and Automatic Semicolon Insertion are two things which make me crazy in ECMAScript grammar. This question and an answers were invaluable for the regex question. In this answer I'd like to put my own findings together.
TL;DR In JavaCC, use lexical states and switch them from the parser.
Very important is what Thom Blake wrote:
The division operator must follow an expression, and a regular
expression literal can't follow an expression, so in all other cases
you can safely assume you're looking at a regular expression literal.
So you actually need to understand if it was an expression or not before. This is trivial in the parser but very hard in the lexer.
As Thom pointed out, in many (but, unfortunately, not all) cases you can understand if it was an expression by "looking" at the last token. You have to consider punctuators as well as keywords.
Let's start with keywords. The following keywords cannot precede a DivPunctuator
(for example, you cannot have case /5
), so if you see a /
after these, you have a RegularExpressionLiteral
:
case
delete
do
else
in
instanceof
new
return
throw
typeof
void
Next, punctuators. The following punctuators cannot precede a DivPunctuator
(ex. in { /a...
the symbol /
can never start a division):
{ ( [
. ; , < > <=
>= == != === !==
+ - * %
<< >> >>> & | ^
! ~ && || ? :
= += -= *= %= <<=
>>= >>>= &= |= ^=
/=
So if you have one of these and see /...
after this, then this can never be a DivPunctuator
and therefore must be a RegularExpressionLiteral
.
Next, if you have:
/
And /...
after that it also must be a RegularExpressionLiteral
. If there were no space between these slashes (i.e. // ...
), this must have handled as a SingleLineComment
("maximal munch").
Next, the following punctuator may only end an expression:
]
So the following /
must start a DivPunctuator
.
Now we have the following remaining cases which are, unfortunately, ambiguous:
}
)
++
--
For }
and )
you have to know if they end an expression or not, for ++
and --
- they end an PostfixExpression
or start an UnaryExpression
.
And I have come to the conclusion that it is very hard (if not impossible) to find out in the lexer. To give you a sense of that, a couple of examples.
In this example:
{}/a/g
/a/g
is a RegularExpressionLiteral
, but in this one:
+{}/a/g
/a/g
is a division.
In case of )
you can have a division:
('a')/a/g
as well as a RegularExpressionLiteral
:
if ('a')/a/g
So, unfortunately, it looks like you can't solve it with the lexer alone. Or you'll have to bring in so much grammar into the lexer so it's no lexer anymore.
This is a problem.
Now, a possible solution, which is, in my case JavaCC-based.
I am not sure if you have similar features in other parser generators, but JavaCC has a lexical states feature which can be used to switch between "we expect a DivPunctuator
" and "we expect a RegularExpressionLiteral
" states. For instance, in this grammar the NOREGEXP
state means "we don't expect a RegularExpressionLiteral
here".
This solves part of the problem, but not the ambiguous )
, }
, ++
and --
.
For this, you'll need to be able to switch lexical states from the parser. This is possible, see the following question in JavaCC FAQ:
Can the parser force a switch to a new lexical state?
Yes, but it is very easy to create bugs by doing so.
A lookahead parser may have already gone too far in the token stream (i.e. already read /
as a DIV
or vice versa).
Fortunately there seems to be a way to make switching lexical states a bit safer:
Is there a way to make SwitchTo safer?
The idea is to make a "backup" token stream and push tokens read during lookahead back again.
I think that this should work for }
, )
, ++
, --
as they are normally found in LOOKAHEAD(1) situations, but I am not 100% sure of that. In the worst case the lexer may have already tried to parse /
-starting token as a RegularExpressionLiteral
and failed as it was not terminated by another /
.
In any case, I see no better way of doing that. The next good thing would be probably to drop the case altogether (like JSLint
and many others did), document and just not parse these types of expressions. {}/a/g
does not make much sense anyway.