2

I am trying to write a regex (in JavaScript) that will match a multi line comment at the beginning of a JS file.

So far, I came up with this: /^(\/\*[^\*\/]*\*\/)/g

It works for a single line comment: https://regex101.com/r/ZS5PVI/1

But my problem is that it does not work for a multi line comment: https://regex101.com/r/ZS5PVI/2

Do you have any ideas how to solve it?

trincot
  • 317,000
  • 35
  • 244
  • 286
warpech
  • 6,293
  • 4
  • 35
  • 36

6 Answers6

4

Like HTML, JavaScript cannot be parsed by regular expressions. Attempting to do so correctly is futile.

Instead, you must use a parser that will correctly transform JavaScript source code into an AST, which you may inspect programmatically. Fortunately, there's libraries that do the parsing for you.

Here's a working example that outputs the AST of this code:

/* this is a
multi-line
comment */

var test = "this is a string, /* and this is not a comment! */";

// ..but this is

Which gets us:

[
  "toplevel",
  [
    [
      {
        "name": "var",
        "start": {
          "type": "keyword",
          "value": "var",
          "line": 5,
          "col": 4,
          "pos": 57,
          "endpos": 60,
          "nlb": true,
          "comments_before": [
            {
              "type": "comment2",
              "value": " this is a\n    multi-line\n    comment ",
              "line": 1,
              "col": 4,
              "pos": 5,
              "endpos": 47,
              "nlb": true
            }
          ]
        },
        "end": {
          "type": "punc",
          "value": ";",
          "line": 5,
          "col": 67,
          "pos": 120,
          "endpos": 121,
          "nlb": false,
          "comments_before": []
        }
      },
      [
        [
          "test",
          [
            {
              "name": "string",
              "start": {
                "type": "string",
                "value": "this is a string, /* and this is not a comment! */",
                "line": 5,
                "col": 15,
                "pos": 68,
                "endpos": 120,
                "nlb": false,
                "comments_before": []
              },
              "end": {
                "type": "string",
                "value": "this is a string, /* and this is not a comment! */",
                "line": 5,
                "col": 15,
                "pos": 68,
                "endpos": 120,
                "nlb": false,
                "comments_before": []
              }
            },
            "this is a string, /* and this is not a comment! */"
          ]
        ]
      ]
    ]
  ]
]

Now it's just a matter of looping over the AST and extracting what you need.

Community
  • 1
  • 1
josh3736
  • 139,160
  • 33
  • 216
  • 263
2

Your suggested Regex desn't work because there is a * in the comment. Additionally, it will only look for comments that are right at the beginning of the file.

Try using this instead:

/\/\*[\s\S]*?\*\//
Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
2

There's a pretty good discussion of this problem at this link. Does that help you?

His solution was:

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/

1

Try

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/

This page goes into some detail on how to find multi-line comments.

sachleen
  • 30,730
  • 8
  • 78
  • 73
0

Here's one that will match any multi-line or single-line comments:

/(\/\*.*?\*\/|\/\/[^\n]+)/

If you just want multi-line matches, ditch the second half:

/\/\*.*?\*\//

For both of these, make sure you have s flag set so the . matches new lines.

Cecchi
  • 1,525
  • 9
  • 9
0

I'm no javascript expert but it seems C/C++ comments have to be taken into account.
Properly done means quotes have to be accounted for in the process (escapes and all that).

Below are two regex methods that work. Regex 1 finds the first C-style comment directly, as soon as it matches, it is found. Regex 2 is a general case. It finds either C style, C++ style, or non-comments, is global, and allows you to break when you find what you want.

Tested here http://ideone.com/i1UWr

Code

var js = '\
// /* C++ comment  */      \\\n\
   /* C++ comment (cont) */  \n\
/* t "h /* is"               \n\
 is first C-style /*         \n\
//  comment */               \n\
and /*second C-style*/       \n\
then /*last C-style*/        \n\
';

var cmtrx1 = /^(?:\/\/(?:[^\\]|\\\n?)*?\n|(?:"(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[^\/"'\\]*))+(\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/)/;

var cmtrx2 = /(\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/)|(\/\/(?:[^\\]|\\\n?)*?)\n|(?:"(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^\/"'\\]*)/g;

//
print ('Script\n===========\n'+js+'\n===========\n\n');

var match;
//
print ("Using Regex 1\n---------------\n");
if ((match=cmtrx1.exec( js )) != null)
    print ("Found C style comment:\n'" + match[1] + "'\n\n");
//
print ("Using Regex 2\n---------------\n");
while ((match=cmtrx2.exec( js )) != null)
{
   if (match[1] != undefined)
   {
        print ("- C style :\n'" + match[1] + "'\n");
        // break;     // uncomment to stop after first c-style match
   }
   // comment this to not print it
   if (match[2] != undefined)
   {
        print ("- C++ style :\n'" + match[2] + "'\n");
   }
}

Output

Script
===========
// /* C++ comment  */      \
   /* C++ comment (cont) */  
/* t "h /* is"               
 is first C-style /*         
//  comment */               
and /*second C-style*/       
then /*last C-style*/        

===========


Using Regex 1
---------------

Found C style comment:
'/* t "h /* is"               
 is first C-style /*         
//  comment */'


Using Regex 2
---------------

- C++ style :
'// /* C++ comment  */      \
   /* C++ comment (cont) */  '

- C style :
'/* t "h /* is"               
 is first C-style /*         
//  comment */'

- C style :
'/*second C-style*/'

- C style :
'/*last C-style*/'

Expanded Regex's

Regex 1:

/^(?:\/\/(?:[^\\]|\\\n?)*?\n|(?:"(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[^\/"'\\]*))+(\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/)/

     /^
     (?:
          \/\/
          (?: [^\\] | \\\n? )*?
          \n
       |
          (?:
               "
               (?: \\[\S\s] | [^"\\] )*
               "
            |  '
               (?: \\[\S\s] | [^'\\] )*
               '
            |  [^\/"'\\]*
          )
     )+
1    (
          \/\* [^*]* \*+
          (?: [^\/*] [^*]* \*+ )*
          \/
1    )
     /


Regex 2:

/(\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/)|(\/\/(?:[^\\]|\\\n?)*?)\n|(?:"(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^\/"'\\]*)/g

     /
1    (
          \/\* [^*]* \*+
          (?: [^\/*] [^*]* \*+ )*
          \/
1    )
  |
2    (
          \/\/
          (?: [^\\] | \\\n? )*?
2    )
     \n
  |
     (?:
          "
          (?: \\[\S\s] | [^"\\] )*
          "
       |  '
          (?: \\[\S\s] | [^'\\] )*
          '
       |  [\S\s][^\/"'\\]*
     )
     /g