1

I'm parsing some JavaScript code and need to get all the regular expressions in it. The literal notation /\/(.(?:[^\/])|\\)+\/[gmi]*/gi looks good, but in some cases it works incorrectly.

For example, for this code:

html = html.replace(/\</g, '&lt;').replace(/\>/g, '&gt;').replace(/\&/g, '&amp;');

match() gives two stupid results: /\</g, '&lt;' ).replace( / and /\&/g

I can't seem to make it work.

Toothbrush
  • 2,080
  • 24
  • 33
  • 1
    I recommend to use an actual JavaScript parser like http://esprima.org/. Then you can walk the AST and get all regular expression literals easily. – Felix Kling Mar 02 '14 at 19:12

1 Answers1

3

You will not get away with this using a single regex. You've now stumbled upon a single corner case which your regex doesn't handler properly, but there are many, many more. Your regex will break when there's an opening regexp literal inside a multi- or single-line comment, or when a / occurs inside a string literal.

The only when to reliably solve this would be to parse the JavaScript, and inspect the token stream the parser (or lexer) produces.

To get started, see: JavaScript parser in JavaScript

user3371384 wrote:

I don't care about comments, because I remove them before getting regexp literals, same about strings.

Regardless, there are more corner cases:

var e = 8, f = 4, g = 2;
// ...
var x = e/f/g; // your regex will match `/f/g` as a regex literal

user3371384 wrote:

In many code parsers the same algorythm is used: find slash, then find next slash (if no backslash before it), all chars inside is regexp.

That may well be, but that is a very inaccurate algorithm (as you can see by the counter example I gave above). There's also the shorthand /= that might foul up the regex.

Anyway, you seem to have made up your mind about using a regex for this...

You placed the . in the wrong place: you only want to match any char after the backslash. Try this:

/\/([^\/]|\\.)+\/[gmi]*/gi
Community
  • 1
  • 1
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • 3
    ... or whereever the regexp has nested parentheses representing a sub-regular expression. The real point is the notation of regular expressions, is not itself regular. Rather, it is context-free. OP is trying to recognize a context-free construct with regular expressions, always doomed to failure. Yes, he needs a real parser. – Ira Baxter Mar 02 '14 at 21:13
  • I don't care about comments, because I remove them before getting regexp literals, same about strings. The problem is getting regexp literal from code [where no comments and strings]. – user3371384 Mar 03 '14 at 04:22
  • In many code parsers the same algorythm is used: find slash, then find next slash (if no backslash before it), all chars inside is regexp. This is good enough algorythm for my purposes, and `/\/(.(?:[^\/])|\\)+\/[gmi]*/gi` seems correct, but not in fact. What wrong there? – user3371384 Mar 03 '14 at 04:28
  • Thank you Bart, I convinced you absolutely right. This is complicated issue needs complicated solution. – user3371384 Mar 03 '14 at 17:22