I'm writing a simple javascript tokenizer which detects basic types: Word, Number, String, RegExp, Operator, Comment and Newline. Everything is going fine but I can't understand how to detect if the current character is RegExp delimiter or division operator. I'm not using regular expressions because they are too slow. Does anybody know the mechanism of detecting it? Thanks.
Asked
Active
Viewed 2,585 times
2 Answers
7
You can tell by what the preceding token is is in the stream. Go through each token that your lexer emits and ask whether it can reasonably be followed by a division sign or a regexp; you'll find that the two resulting sets of tokens are disjoint. For example, (
, [
, {
, ;
, and all of the binary operators can only be followed by a regexp. Likewise, )
, ]
, }
, identifiers, and string/number literals can only be followed by a division sign.
See Section 7 of the ECMAScript spec for more details.

pmdboi
- 534
- 2
- 4
-
2I wrote a tokenizer once, and these are the REs that detect a "regex trigger": `/[{(\[;,]/` `/\+\+|--|~|&&|\?|:|\|\||\\$|(<<|>>>?|==?|!=?|[-<>+*%&\|\^/])=?/` `/^(?=\s|\/| – user123444555621 Feb 13 '11 at 16:07
-
6Technically, there are a couple ambiguities that are unavoidable at the lexical level. For example, `(a+b)/c` vs. `if (x) /foo/.exec('bar')` (close-paren can precede either). Also, `++ /foo/.abc` and `a++ / b` (plus-plus can precede either). Together with `--` these are the only ones I know of. – dgreensp Sep 04 '12 at 21:45
-
@dgreensp Thanks very useful observations !!! – Zo72 Sep 18 '12 at 20:31
-
2There's also a problem with `}`: `function f() {}`(newline)`/1/g` versus `var x = {}`(newline)`/1/g`, since the the latter [doesn't](http://es5.github.com/#x7) enforce semicolon insertion. – user123444555621 Jan 29 '13 at 12:28
2
you have to check the context when encounter the slash. if the slash is after a expression, then it must be division, or it is a regexp start.
in order to recognize the context, maybe you have to make a syntax parser.
for example
function f() {}
/1/g
//this case ,the slash is after a function definition, so it's a refexp start
var a = {}
/1/g;
//this case, the slash is after an object expression,so it's a division

define.cc
- 71
- 6