5

I need to dependably remove all JavaScript comments with a single Regular Expression.

I have searched StackOverflow, and other sites, but none take into account alternating quotes, multi-line comments, comments within strings, regular expressions, etc.

Is there any Regular expressions that can remove the comments from this:

var test = [
    "// Code",
    '// Code',
    "'// Code",
    '"// Code',
    //" Comment",
    //' Comment',
    /* Comment */
    // Comment /* Comment
    /* Comment
     Comment // */ "Code",
    "Code",
    "/* Code */",
    "/* Code",
    "Code */",
    '/* Code */',
    '/* Code',
    'Code */',
    /* Comment
    "Comment",
    Comment */ "Code",
    /Code\/*/,
    "Code */"
]

Here's a jsbin or jsfiddle to test it.

wizulus
  • 5,653
  • 2
  • 23
  • 40

5 Answers5

8

I like challenges :)

Here's my working solution:

/((["'])(?:\\[\s\S]|.)*?\2|\/(?![*\/])(?:\\.|\[(?:\\.|.)\]|.)*?\/)|\/\/.*?$|\/\*[\s\S]*?\*\//gm

Replace that with $1.

Fiddle here: http://jsfiddle.net/LucasTrz/DtGq8/6/

Of course, as it has been pointed out countless times, a proper parser would probably be better, but still...

NB: I used a regex literal in the fiddle insted of a regex string, too much escaping can destroy your brain.


Breakdown

((["'])(?:\\[\s\S]|.)*?\2|\/(?![*\/])(?:\\.|\[(?:\\.|.)\]|.)*?\/) <-- the part to keep
|\/\/.*?$                                                         <-- line comments
|\/\*[\s\S]*?\*\/                                                 <-- inline comments

The part to keep

(["'])(?:\\[\s\S]|.)*?\2                   <-- strings
\/(?![*\/])(?:\\.|\[(?:\\.|.)\]|.)*?\/     <-- regex literals

Strings

    ["']              match a quote and capture it
    (?:\\[\s\S]|.)*?  match escaped characters or unescpaed characters, don't capture
    \2                match the same type of quote as the one that opened the string

Regex literals

    \/                          match a forward slash
    (?![*\/])                   ... not followed by a * or / (that would start a comment)
    (?:\\.|\[(?:\\.|.)\]|.)*?   match any sequence of escaped/unescaped text, or a regex character class
    \/                          ... until the closing slash

The part to remove

|\/\/.*?$              <-- line comments
|\/\*[\s\S]*?\*\/      <-- inline comments

Line comments

    \/\/         match two forward slashes
    .*?$         then everything until the end of the line

Inline comments

    \/\*         match /*
    [\s\S]*?     then as few as possible of anything, see note below
    \*\/         match */

I had to use [\s\S] instead of . because unfortunately JavaScript doesn't support the regex s modifier (singleline - this one allows . to match newlines as well)

This regex will work in the following corner cases:

  • Regex patterns containing / in character classes: /[/]/
  • Escaped newlines in string literals

Final boss fight

And just for the fun of it... here's the eye-bleeding hardcore version:

/((["'])(?:\\[\s\S]|.)*?\2|(?:[^\w\s]|^)\s*\/(?![*\/])(?:\\.|\[(?:\\.|.)\]|.)*?\/(?=[gmiy]{0,4}\s*(?![*\/])(?:\W|$)))|\/\/.*?$|\/\*[\s\S]*?\*\//gm

This adds the following twisted edge case (fiddle, regex101):

Code = /* Comment */ /Code regex/g  ; // Comment
Code = Code / Code /* Comment */ /g  ; // Comment    
Code = /Code regex/g /* Comment */  ; // Comment

This is highly heuristical code, you probably shouldn't use it (even less so than the previous regex) and just let that edge case blow.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • 1
    I'll edit the answer to provide a detailed breakdown, but for now, it means: match an escaped character, or an unescaped one, without capturing it. – Lucas Trzesniewski Jul 01 '14 at 20:08
  • 1
    No. `\.` simply matches a dot. Also, the escaped character could be a newline, and the non-escaped character must neither be a backslash nor a delimiter. – Bergi Jul 01 '14 at 20:28
  • 1
    @Bergi Whoops, of course for the `\.`, thanks. But in JS `.` doesn't match a newline, and the `.` won't match a backslash because of the first alternative – Lucas Trzesniewski Jul 01 '14 at 20:30
  • Yes, it doesn't, but it *should*. And the lone `.` will only not match a backslash if there is no backtracking. Sorry for nitpicking :-) – Bergi Jul 01 '14 at 20:36
  • 1
    @Bergi No, `.` should **not** match a newline _unless_ the `s` modifier is set. This modifier doesn't exist in JS (see the note at the bottom of the answer). The lone `.` is not a problem here. – Lucas Trzesniewski Jul 01 '14 at 20:37
  • 1
    Also, added the case for the `/[/]/` regex pattern, thanks for pointing that out – Lucas Trzesniewski Jul 01 '14 at 20:38
  • 1
    I mean that string literals can contain newlines (if escaped), but your regex doesn't match them. – Bergi Jul 01 '14 at 20:39
  • 1
    Argh, sorry for the misunderstanding. I didn't know that. I'll update the regex. ;) – Lucas Trzesniewski Jul 01 '14 at 20:41
  • 1
    @Bergi I forgot to say: actually, thanks for your nitpicking :) – Lucas Trzesniewski Jul 01 '14 at 20:48
  • Wow, thanks @LucasTrzesniewski ! Also thanks for pointing out how mind numbing escaping can be :). Here's a more elegant fix for the readability: http://jsfiddle.net/alancnet/DtGq8/3/ – wizulus Jul 01 '14 at 21:00
  • 1
    @alancnet hehe, also, take a look at that last edit :P – Lucas Trzesniewski Jul 01 '14 at 22:07
  • 1
    Very nice! In fact, your latest expression also fixes a bug I encountered with the first.. Windows! The first didn't match \r\n, and didn't filter my file correctly. I fixed it with [\r|\n], but by the time I got back to tell you, you had this FINAL BOSS FIGHT! +1 for all the hard work. – wizulus Jul 01 '14 at 22:16
  • @alancnet Yeah there was a problem I noticed while I was doing that final regex. I've updated the first one though (using `$` and the `m` modifier), you should use that one instead, even if it's less *powerful*. – Lucas Trzesniewski Jul 01 '14 at 22:20
  • What is the reason not to use RegExp? Especially the boss fight expression? Is it because it's not human readable (except for super geniuses like yourself)? or is it that it's poor performance? – wizulus Jul 01 '14 at 22:28
  • @alancnet The reason I'm advising you not to use that final one is that because of the heuristics I've introduced in it, the regex may fail in subtle ways on syntax I did not anticipate. The first regex is simpler, therefore it's more likely to be correct, even if we know it blows in one very unlikely case. That should be acceptable. Performance is not a problem at all, this regex is more likey to be *faster* than a parser written in JS. Also, I'm not a genius lol, I've just written a lot of regexes in the past ;) – Lucas Trzesniewski Jul 01 '14 at 22:34
1

First off, I suggest doing this with a proper JavaScript parser instead. Checkout this previous Q&A: JavaScript parser in JavaScript

For the input you've provided1, here is a solution that might work:

Match the pattern:

/("(?:[^\r\n\\"]|\\.)*"|'(?:[^\r\n\\']|\\.)*'|\/[^*\/]([^\\\/]|\\.)*\/[gm]*)|\/\/[^\r\n]*|\/\*[\s\S]*?\*\//g

Here's a break down of the pattern:

/
  (                                     # start match group 1
      "(?:[^\r\n\\"]|\\.)*"             #   match a double quoted string
    | '(?:[^\r\n\\']|\\.)*'             #   match a single quoted string
    | \/[^*\/]([^\\\/]|\\.)*\/[gm]*     #   match a regex literal
  )                                     # end match group 1
  | \/\/[^\r\n]*                        # match a single line break
  | \/\*[\s\S]*?\*\/                    # match a multi-line break
/g

and replace it with $1 (match group 1). The trick here is that anything besides a comment is matched in group 1, which get replaced with itself again but comments get replaced with an empty string.

Here's a regexr demo that shows the following replacement:

  var test = [
      "// Code",
      '// Code',
      "'// Code",
      '"// Code',




       "Code",
      "Code",
      "/* Code */",
      "/* Code",
      "Code */",
      '/* Code */',
      '/* Code',
      'Code */',
       "Code",
      /Code\/*/,
      "Code */"
  ]

1 Again, a parser is the way to go since regex literals might be confused with the division operator. If you have an assignment like var x = a / b / g; in your source, the solution above will break!

Community
  • 1
  • 1
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • You seem to be well aware of the problems with this - why even suggest a regular expression? This fails for so many cases. – Benjamin Gruenbaum Jul 01 '14 at 20:00
  • @BenjaminGruenbaum, for the fun of it :) – Bart Kiers Jul 01 '14 at 20:01
  • Solid retort :) It does look fun, you might want to suggest esprima and escodegen for something saner you can use to remove comments with in a line. – Benjamin Gruenbaum Jul 01 '14 at 20:02
  • @BenjaminGruenbaum, I was reluctant to recommend specific JS parsers since I have no personal experience with any of them. But I can post a previous Q&A about JS parser. – Bart Kiers Jul 01 '14 at 20:03
  • Why are single-quoted strings allowed to contain linebreaks but double-quoted ones are not? (btw, the `.` does not match an escaped linebreak). Also, `/[/]/` is a valid regular expression. – Bergi Jul 01 '14 at 20:04
  • @Bergi, forgot to include `\r\n`. Luckily `/[/]/` wasn't in the OP's test set! All jokes aside, I think my example with the division operator should scare away any sane mind from going the regex way here (besides the eye-bleeding regex "solution"...). – Bart Kiers Jul 01 '14 at 20:10
0

I suggest you look at parsing JavaScript using a JavaScript parser of itself and then leverage the parser API to strip out what you don't want. I have not personally done this, but regular expressions should be limited to regular content, which I doubt JS falls into.

Here are some good places to look.

JavaScript parser in JavaScript

Community
  • 1
  • 1
gahooa
  • 131,293
  • 12
  • 98
  • 101
0

Is there any Regular expressions that can remove the comments

No. You cannot build a regex that will match a comment (so that you simply can replace the match with the empty string), because without lookbehind it is impossible to determine whether //" is a comment or the end of a string literal.

You could use a regex as a tokenizer (you "only" need to take care of string literals, regex literals, and the two types of comments), but I'd recommend to use a full-blown JavaScript parser, they are freely available.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • I'm not convinced about the first "no", I'm pretty sure it's possible to determine whether `//"` is a comment or the end of a string literal depending on where in the regular expression you are. Moreover - intuitively since you only have a finite amount of stuff to save - this should be regular - I can certainly imagine a DFA for this. Your second paragraph is spot on though. – Benjamin Gruenbaum Jul 01 '14 at 20:06
  • I mean a regular expression that **only** matches the comment - that seems to be impossible. Sure, with capturing groups and an intelligent replacer function that sorts out comments but keeps literals (similar to what BartKiers has done), it could be done. – Bergi Jul 01 '14 at 20:10
-1

test.replace(/(/*([\s\S]?)*/)|(//(.)$)/gm, '');