javascript parse/identify a regular expression from beginning to end

Question

I am writing a tiny javascript parser in javascript. I am at the tokenization level.

I would like to know how to recognize when a regular expression begins and ends.

For example, if I had asked the same question about how to recognize when a string begins and ends the answer would be:

for a string beginning with double quotes " I know that the answer is string begins with double quotes " and ends when the next double quotes " is encountered (except if preceded by backward-slash \)

any help appreciated

The `/` character must be quoted in a regular expression (`\/`) so it's like strings in that respect. — Pointy, Sep 18 '12 at 20:26
@Pointy, it's not that easy. The `/` could be either the start of a regex literal, or the division operator. — Bart Kiers, Sep 18 '12 at 20:28
possible duplicate of [Division/RegExp conflict while tokenizing Javascript](http://stackoverflow.com/questions/4726295/division-regexp-conflict-while-tokenizing-javascript) — Bart Kiers, Sep 18 '12 at 20:28
Also see this Q&A: http://stackoverflow.com/questions/5519596/when-parsing-javascript-what-determines-the-meaning-of-a-slash — Bart Kiers, Sep 18 '12 at 20:30
@BartKiers oh yes I know that - I almost added another comment. The point is that once you already know you're looking at a regex, you can find the end pretty easily. — Pointy, Sep 18 '12 at 20:31
@Pointy, sure, but the question is *"when a regular expression **begins** and ends"* :) — Bart Kiers, Sep 18 '12 at 20:32
@BartKiers ah you're right; I was focusing on the part about the closing quotes. Anyway I consider this a fascinating problem; the standard way that it's handled for JavaScript seems awful. Pascal has a minor lexer-level ambiguity too: arrays are declared with `1..10`, but that can be solved by introducing an "integer-dot-dot" token and (slightly) ammending the grammar. Seems like that'd be somewhat harder for JavaScript. — Pointy, Sep 18 '12 at 20:35
@BartKiers (now that I think of it that's not really an ambiguity in Pascal; it's just a lexer look-ahead issue.) — Pointy, Sep 18 '12 at 20:37
@Pointy and @ BartKiers thanks a lot for your help. I did not realize this was such a tricky problem — Zo72, Sep 18 '12 at 20:44

score 2 · Accepted Answer · answered Sep 18 '12 at 20:32

2

The ECMAScript language specification contains a full grammar for the language (in EBNF) in Annex A. It's too large to reproduce here in its entirety, but the production for regular expressions is given as "RegularExpressionLiteral".

answered Sep 18 '12 at 20:32

Too hard for what? It's the official definition of the Javascript language, so it's what you'll need to work with to correctly parse Javascript. Anything else **will** fail or give incorrect results in some circumstances. – Sep 18 '12 at 23:23

score -2 · Answer 2 · answered Sep 18 '12 at 20:29

-2

"In JavaScript source code, a regular expression is written in the form of /pattern/modifiers where "pattern" is the regular expression itself, and "modifiers" are a series of characters indicating various options. The "modifiers" part is optional." JavaScript RegExp Object

answered Sep 18 '12 at 20:29

Pablo

360
3
13

That does not explain how to make a distinction between the division operator and the start of a regex literal. – Bart Kiers Sep 18 '12 at 20:31
I followed the example of how to determine the start and end of a string, and in the question he didn't mention he wanted to know the distinction between the division operator and the start of a regex literal. From his example, an object property could be misinterpreted as a string {"foo": "bar"}, where "foo" is not really a string. I don't think it's a good reason to downvote. – Pablo Sep 18 '12 at 20:44
@Pablo Martinez I did not downvote you. I would not vote yours as a right answer either – Zo72 Sep 18 '12 at 20:46

javascript parse/identify a regular expression from beginning to end

2 Answers2