1

I am writing a tiny javascript parser in javascript. I am at the tokenization level.

I would like to know how to recognize when a regular expression begins and ends.

For example, if I had asked the same question about how to recognize when a string begins and ends the answer would be:

for a string beginning with double quotes " I know that the answer is string begins with double quotes " and ends when the next double quotes " is encountered (except if preceded by backward-slash \)

any help appreciated

Zo72
  • 14,593
  • 17
  • 71
  • 103
  • The `/` character must be quoted in a regular expression (`\/`) so it's like strings in that respect. – Pointy Sep 18 '12 at 20:26
  • @Pointy, it's not that easy. The `/` could be either the start of a regex literal, or the division operator. – Bart Kiers Sep 18 '12 at 20:28
  • possible duplicate of [Division/RegExp conflict while tokenizing Javascript](http://stackoverflow.com/questions/4726295/division-regexp-conflict-while-tokenizing-javascript) – Bart Kiers Sep 18 '12 at 20:28
  • Also see this Q&A: http://stackoverflow.com/questions/5519596/when-parsing-javascript-what-determines-the-meaning-of-a-slash – Bart Kiers Sep 18 '12 at 20:30
  • @BartKiers oh yes I know that - I almost added another comment. The point is that once you already know you're looking at a regex, you can find the end pretty easily. – Pointy Sep 18 '12 at 20:31
  • @Pointy, sure, but the question is *"when a regular expression **begins** and ends"* :) – Bart Kiers Sep 18 '12 at 20:32
  • @BartKiers ah you're right; I was focusing on the part about the closing quotes. Anyway I consider this a fascinating problem; the standard way that it's handled for JavaScript seems awful. Pascal has a minor lexer-level ambiguity too: arrays are declared with `1..10`, but that can be solved by introducing an "integer-dot-dot" token and (slightly) ammending the grammar. Seems like that'd be somewhat harder for JavaScript. – Pointy Sep 18 '12 at 20:35
  • @BartKiers (now that I think of it that's not really an ambiguity in Pascal; it's just a lexer look-ahead issue.) – Pointy Sep 18 '12 at 20:37
  • @Pointy and @ BartKiers thanks a lot for your help. I did not realize this was such a tricky problem – Zo72 Sep 18 '12 at 20:44
  • Zo72, did you see the links I posted? – Bart Kiers Sep 18 '12 at 20:55

2 Answers2

2

The ECMAScript language specification contains a full grammar for the language (in EBNF) in Annex A. It's too large to reproduce here in its entirety, but the production for regular expressions is given as "RegularExpressionLiteral".

  • Too hard for what? It's the official definition of the Javascript language, so it's what you'll need to work with to correctly parse Javascript. Anything else **will** fail or give incorrect results in some circumstances. –  Sep 18 '12 at 23:23
-2

"In JavaScript source code, a regular expression is written in the form of /pattern/modifiers where "pattern" is the regular expression itself, and "modifiers" are a series of characters indicating various options. The "modifiers" part is optional." JavaScript RegExp Object

Pablo
  • 360
  • 3
  • 13
  • That does not explain how to make a distinction between the division operator and the start of a regex literal. – Bart Kiers Sep 18 '12 at 20:31
  • I followed the example of how to determine the start and end of a string, and in the question he didn't mention he wanted to know the distinction between the division operator and the start of a regex literal. From his example, an object property could be misinterpreted as a string {"foo": "bar"}, where "foo" is not really a string. I don't think it's a good reason to downvote. – Pablo Sep 18 '12 at 20:44
  • @Pablo Martinez I did not downvote you. I would not vote yours as a right answer either – Zo72 Sep 18 '12 at 20:46