When parsing Javascript, what determines the meaning of a slash?

Question

Javascript has a tricky grammar to parse. Forward-slashes can mean a number of different things: division operator, regular expression literal, comment introducer, or line-comment introducer. The last two are easy to distinguish: if the slash is followed by a star, it starts a multiline comment. If the slash is followed by another slash, it is a line-comment.

But the rules for disambiguating division and regex literal are escaping me. I can't find it in the ECMAScript standard. There the lexical grammar is explicitly divided into two parts, InputElementDiv and InputElementRegExp, depending on what a slash will mean. But there's nothing explaining when to use which.

And of course the dreaded semicolon insertion rules complicate everything.

Does anyone have an example of clear code for lexing Javascript that has the answer?

It seems to me, from reading the spec, that the *parser* needs to know what sort of token to go fetch. That seems like a horrible grammar feature, but whatever. It seems awful clumsy, too, because while parsing an expression the grammar has to try one of those two, *and* the more "generic" request for another "ordinary" token. Ick. If I were faced with that I think I'd go back and fix the grammar :-) — Pointy, Apr 01 '11 at 23:05
@Pointy From my understanding, the parser tries both tokens and since there are no contexts where both are valid anyway, it uses the one that is valid in the given context. — Šime Vidas, Apr 01 '11 at 23:11
My understanding about javascript is that you can't write a lexer without also writing a parser, which is unlike many other languages. — MarkPflug, Apr 01 '11 at 23:14
Hmm. I just can't imagine having a lexer work that way, but I'm pretty simple-minded. In my (tiny) world, there's a one-way flow from the lexer to the parser. With this setup, the lexer really doesn't know what it's supposed to do. When one is valid, attempting the other will almost certainly produce an error (particularly since the regex grammar can send the lexer screaming through a lot of input text needlessly). — Pointy, Apr 01 '11 at 23:15
http://www.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions — MarkPflug, Apr 01 '11 at 23:22
None of the current answers address `await` before a slash, which can be [ambiguous](https://stackoverflow.com/questions/55934490/why-are-await-and-async-valid-variable-names/55934491#55934491) - if inside an `async` function, the `/` will be parsed as a regex, otherwise, the `await` will be parsed as a variable name, and hence the `/` will be parsed as division. — CertainPerformance, May 02 '19 at 09:08

Tamzin Blake · Answer 1 · 2014-11-24T19:24:48.907

It's actually fairly easy, but it requires making your lexer a little smarter than usual.

The division operator must follow an expression, and a regular expression literal can't follow an expression, so in all other cases you can safely assume you're looking at a regular expression literal.

You already have to identify Punctuators as multiple-character strings, if you're doing it right. So look at the previous token, and see if it's any of these:

. ( , { } [ ; , < > <= >= == != === !== + - * % ++ --
<< >> >>> & | ^ ! ~ && || ? : = += -= *= %= <<= >>= >>>=
&= |= ^= / /=

For most of these, you now know you're in a context where you can find a regular expression literal. Now, in the case of ++ --, you'll need to do some extra work. If the ++ or -- is a pre-increment/decrement, then the / following it starts a regular expression literal; if it is a post-increment/decrement, then the / following it starts a DivPunctuator.

Fortunately, you can determine whether it is a "pre-" operator by checking its previous token. First, post-increment/decrement is a restricted production, so if ++ or -- is preceded by a linebreak, then you know it is "pre-". Otherwise, if the previous token is any of the things that can precede a regular expression literal (yay recursion!), then you know it is "pre-". In all other cases, it is "post-".

Of course, the ) punctuator doesn't always indicate the end of an expression - for example if (something) /regex/.exec(x). This is tricky because it does require some semantic understanding to disentangle.

Sadly, that's not quite all. There are some operators that are not Punctuators, and other notable keywords to boot. Regular expression literals can also follow these. They are:

new delete void typeof instanceof in do return case throw else

If the IdentifierName you just consumed is one of these, then you're looking at a regular expression literal; otherwise, it's a DivPunctuator.

The above is based on the ECMAScript 5.1 specification (as found here) and does not include any browser-specific extensions to the language. But if you need to support those, then this should provide easy guidelines for determining which sort of context you're in.

Of course, most of the above represent very silly cases for including a regular expression literal. For example, you can't actually pre-increment a regular expression, even though it is syntactically allowed. So most tools can get away with simplifying the regular expression context checking for real-world applications. JSLint's method of checking the preceding character for (,=:[!&|?{}; is probably sufficient. But if you take such a shortcut when developing what's supposed to be a tool for lexing JS, then you should make sure to note that.

This approach works for most realistic code, but will not lex this example correctly: `if (something) /regex/.exec(x);` — JacquesB, Nov 28 '12 at 13:59
@JacquesB `exec` has no side-effects. Is there a realistic example of having a regex start a statement? — John Dvorak, Apr 16 '14 at 08:02
Actually exec() does have side effects: It updates some properties on the RegExp constructor, eg. RegExp.$1 - It is probably not very common, but it is possible to write meaningful code which uses exec() like that. — JacquesB, Apr 16 '14 at 15:21
@JanDvorak Also the answer should work for all syntactically valid code regardless of realisticity. — Tamzin Blake, Apr 16 '14 at 18:36
`new delete void typeof instanceof in do return case throw` - shouldn't `else` be included as well? `if (true) {} else /regex/;` — lexicore, Nov 24 '14 at 07:30
@lexicore Yeah probably. At this point I'm fairly convinced this method can't be fixed though. — Tamzin Blake, Nov 24 '14 at 19:24
@ThomBlake It was still worth it. I'll add my answer (JavaCC-based) later on. You gave very good hints. — lexicore, Nov 24 '14 at 19:30
A couple of those are *realistic* keywords that may well precede a regular expression in *serious* code, which the JSHint method chokes on: `return /foo.exec`, `typeof /foo/.exec(...`. It's a quick, neat method, but probably shouldn't be relied on. — CertainPerformance, May 02 '19 at 08:56
See the other answer - `}` *may be followed* by either a regular expression *or* division. It's not unambiguous, could you edit your answer? — CertainPerformance, May 02 '19 at 09:00
I believe `of` may also come before a regex: `for (const a of /foo/.exec('foo')) {` — CertainPerformance, May 03 '19 at 05:56

score 15 · Answer 2 · edited Oct 19 '17 at 13:42

I am currently developing a JavaScript/ECMAScript 5.1 parser with JavaCC. RegularExpressionLiteral and Automatic Semicolon Insertion are two things which make me crazy in ECMAScript grammar. This question and an answers were invaluable for the regex question. In this answer I'd like to put my own findings together.

TL;DR In JavaCC, use lexical states and switch them from the parser.

Very important is what Thom Blake wrote:

The division operator must follow an expression, and a regular expression literal can't follow an expression, so in all other cases you can safely assume you're looking at a regular expression literal.

So you actually need to understand if it was an expression or not before. This is trivial in the parser but very hard in the lexer.

As Thom pointed out, in many (but, unfortunately, not all) cases you can understand if it was an expression by "looking" at the last token. You have to consider punctuators as well as keywords.

Let's start with keywords. The following keywords cannot precede a DivPunctuator (for example, you cannot have case /5), so if you see a / after these, you have a RegularExpressionLiteral:

case
delete
do
else
in
instanceof
new
return
throw
typeof
void

Next, punctuators. The following punctuators cannot precede a DivPunctuator (ex. in { /a... the symbol / can never start a division):

{       (       [   
.   ;   ,   <   >   <=
>=  ==  !=  === !== 
+   -   *   %       
<<  >>  >>> &   |   ^
!   ~   &&  ||  ?   :
=   +=  -=  *=  %=  <<=
>>= >>>=    &=  |=  ^=
    /=

So if you have one of these and see /... after this, then this can never be a DivPunctuator and therefore must be a RegularExpressionLiteral.

Next, if you have:

And /... after that it also must be a RegularExpressionLiteral. If there were no space between these slashes (i.e. // ...), this must have handled as a SingleLineComment ("maximal munch").

Next, the following punctuator may only end an expression:

So the following / must start a DivPunctuator.

Now we have the following remaining cases which are, unfortunately, ambiguous:

}
)
++
--

For } and ) you have to know if they end an expression or not, for ++ and -- - they end an PostfixExpression or start an UnaryExpression.

And I have come to the conclusion that it is very hard (if not impossible) to find out in the lexer. To give you a sense of that, a couple of examples.

In this example:

{}/a/g

/a/g is a RegularExpressionLiteral, but in this one:

+{}/a/g

/a/g is a division.

In case of ) you can have a division:

('a')/a/g

as well as a RegularExpressionLiteral:

if ('a')/a/g

So, unfortunately, it looks like you can't solve it with the lexer alone. Or you'll have to bring in so much grammar into the lexer so it's no lexer anymore.

This is a problem.

Now, a possible solution, which is, in my case JavaCC-based.

I am not sure if you have similar features in other parser generators, but JavaCC has a lexical states feature which can be used to switch between "we expect a DivPunctuator" and "we expect a RegularExpressionLiteral" states. For instance, in this grammar the NOREGEXP state means "we don't expect a RegularExpressionLiteral here".

This solves part of the problem, but not the ambiguous ), }, ++ and --.

For this, you'll need to be able to switch lexical states from the parser. This is possible, see the following question in JavaCC FAQ:

Can the parser force a switch to a new lexical state?

Yes, but it is very easy to create bugs by doing so.

A lookahead parser may have already gone too far in the token stream (i.e. already read / as a DIV or vice versa).

Fortunately there seems to be a way to make switching lexical states a bit safer:

Is there a way to make SwitchTo safer?

The idea is to make a "backup" token stream and push tokens read during lookahead back again.

I think that this should work for }, ), ++, -- as they are normally found in LOOKAHEAD(1) situations, but I am not 100% sure of that. In the worst case the lexer may have already tried to parse /-starting token as a RegularExpressionLiteral and failed as it was not terminated by another /.

In any case, I see no better way of doing that. The next good thing would be probably to drop the case altogether (like JSLint and many others did), document and just not parse these types of expressions. {}/a/g does not make much sense anyway.

This is an awesome answer. Regarding the last paragraph, the other option is to just lex and parse at the same time, which is the standard these days. — Tamzin Blake, Nov 25 '14 at 22:43
@ThomBlake Thank you. lex and parse at the same time - do you maybe have a hint for me, what could I use for Java? Right now I'm on JavaCC. I'm novice in the field so would be grateful for a pointer. Thank you. — lexicore, Nov 25 '14 at 22:47
I know roughly 0 about Java, and most parsers I've written I've done by hand. If it helps, Rhino is in Java and you could probably borrow some code. — Tamzin Blake, Nov 26 '14 at 18:24
It's not hard, `{} /a/g`. `/a/g` is a regexp just because it's on the statements context, where `{}` is a block statement. If the parser was assuming it was an expression context, then it'd be a division. Though I got some ideas from your answer, yea — , Oct 10 '17 at 10:26
For newer ECMAScript versions, you might want to look behind for `await`, `default` (as in `export default /a/g`), `extends`, `yield`, `...`, `??` and `?.` as well. — lydell, Jan 29 '20 at 06:41
Also, regular expressions are allowed in template interpolations. ``` `${/a/}` ``` — lydell, Mar 28 '20 at 21:25

score 5 · Answer 3 · edited Feb 07 '22 at 13:00

5

JSLint appears to expect a regular expression if the preceding token is one of

(,=:[!&|?{};

Rhino always returns a DIV (slash) token from the lexer.

edited Feb 07 '22 at 13:00

Gurwinder Singh

38,557
6
51
76

answered Apr 04 '11 at 08:44

Vinay Sajip

95,872
14
179
191

score 4 · Answer 4 · edited Dec 10 '13 at 18:02

4

You can only know how to interpret the / by also implementing a syntax parser. Whichever lex path arrives at a valid parse determines how to interpret the character. Apparently, this is something they had considered fixing, but didn't. More reading here: http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions

edited Dec 10 '13 at 18:02

Matt Bierner

58,117
21
175
206

answered Apr 01 '11 at 23:24

MarkPflug

28,292
8
46
54

1

There's a fairly straightforward rule in that page, using the previous token to determine the meaning of the slash. But it's a js 2.0 rule, so it doesn't apply to current code? – Ned Batchelder Apr 02 '11 at 00:59

Jason S · Answer 5 · 2011-04-02T13:01:11.833

See section 7:

There are two goal symbols for the lexical grammar. The InputElementDiv symbol is used in those syntactic grammar contexts where a leading division (/) or division-assignment (/=) operator is permitted. The InputElementRegExp symbol is used in other syntactic grammar contexts.

NOTE There are no syntactic grammar contexts where both a leading division or division-assignment, and a leading RegularExpressionLiteral are permitted. This is not affected by semicolon insertion (see 7.9); in examples such as the following:
a = b 
/hi/g.exec(c).map(d); 
where the first non-whitespace, non-comment character after a LineTerminator is slash (/) and the syntactic context allows division or division-assignment, no semicolon is inserted at the LineTerminator. That is, the above example is interpreted in the same way as:
a = b / hi / g.exec(c).map(d); 

I agree, it's confusing and there should be one top-level grammar expression rather than two.

edit:

But there's nothing explaining when to use which.

Maybe the simple answer is staring us in the face: try one and then try the other. Since they are not both permitted, at most one will yield an error-free match.

From OP's question: *"But there's nothing explaining when to use which."* - I think this is the main issue of this question. Could you address this? — Šime Vidas, Apr 01 '11 at 22:52
Although your quote does state that there are no contexts where both are allowed... — Šime Vidas, Apr 01 '11 at 23:06
I read this part. It says there is no overlap, but it doesn't say when to choose one over the other. — Ned Batchelder, Apr 02 '11 at 00:55

When parsing Javascript, what determines the meaning of a slash?

5 Answers5

Linked

Related