3

I'm trying to construct a regular expression to strip all comments from javascript code, both single-line (//...) and multiline (/*..*/). This is what I've come up with:

/\"[^\"]*\"|'[^']*'|(\/\/.*$|\/\*[^\*]*\*\/)/mg

Description: As you can see it searches also for string literals. This is because string literals can contain content that otherwise could match the comment patterns (for example: location.href = "http://www.domain.com"; will match as single line comment). So I put the string literal patterns first among the alternatives patterns. Following this are the two patterns which are intended to catch single line comments and multiline comments, respectively. These are enclosed in the same capturing group, so that I can use string.replace(pattern, "") to remove the comments.

I've tested the expression with a couple of js-files and it seems to be working. My question is if there are other patterns that I should be looking for or if there are any other things to consider (for example if there is limited support for regular expressions or alternative implementation in some browsers that need to be considered).

instantMartin
  • 85
  • 2
  • 8
  • *"I'm trying to construct a regular expression to strip all comments from javascript code."* You can't, it's not a problem regular expressions can solve on their own. You can get *close*, but there **will** be situations where it will go wrong, possibly in a destructive way (e.g., removing code). – T.J. Crowder Mar 10 '15 at 12:17
  • 1
    Have you got any examples of problem situations that could occur? And any suggestions on what to use in combination or instead to strip comments. – instantMartin Mar 10 '15 at 12:54
  • I guess, T.J. means the issues that may be caused by `'`, `\'` (doesn't end a string), `\\ ` (`\\'` does end a string, `\\\'` doesn't), `'..."...'` (here `"` doesn't begin or end a string) and all combinations of `'`, `"` and '\' symbols. So in fact, for each line one have parse string literals first (or maybe at the same time comments are parsed) and then remove comments that are actually not parts of strings. – YakovL Mar 24 '16 at 15:59
  • Specifically regarding comments in html (which covers JS comments). Could be of help: https://stackoverflow.com/a/64617472/3799617 – justFatLard Oct 31 '20 at 07:02

5 Answers5

2

Use a C/C++ style comment stripper.
The below regex does these things

  • Strips both /**/ and // styles
  • Handles line continuation style
  • Preserves formatting

There are two forms of the regex to do format preservation:

  1. Horizontal tab \hand newline \n construct
  2. Space & tab [ \t] and \r?\n construct

The flags are multiline and global.
The replacement is capture group 2, $2 or \2.

Form 1:

 raw:  ((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\\s]*)
 delimited:  /((?:(?:^\h*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:\h*\n(?=\h*(?:\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|\/\*|\/\/))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^\/"'\\\s]*)/mg     

Form 2:

 raw:   ((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
 delimited:  /((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^\/"'\\\s]*)/mg

Expanded (formatted with this) version of Form 2:

 (                                # (1 start), Comments 
        (?:
             (?: ^ [ \t]* )?                  # <- To preserve formatting
             (?:
                  /\*                              # Start /* .. */ comment
                  [^*]* \*+
                  (?: [^/*] [^*]* \*+ )*
                  /                                # End /* .. */ comment
                  (?:                              # <- To preserve formatting 
                       [ \t]* \r? \n                                      
                       (?=
                            [ \t]*                  
                            (?: \r? \n | /\* | // )
                       )
                  )?
               |  
                  //                               # Start // comment
                  (?:                              # Possible line-continuation
                       [^\\] 
                    |  \\ 
                       (?: \r? \n )?
                  )*?
                  (?:                              # End // comment
                       \r? \n                               
                       (?=                              # <- To preserve formatting
                            [ \t]*                          
                            (?: \r? \n | /\* | // )
                       )
                    |  (?= \r? \n )
                  )
             )
        )+                               # Grab multiple comment blocks if need be
   )                                # (1 end)

|                                 ## OR

   (                                # (2 start), Non - comments 
        "
        (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
        "
     |  '
        (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
        ' 
     |  (?: \r? \n | [\S\s] )            # Linebreak or Any other char
        [^/"'\\\s]*                      # Chars which doesn't start a comment, string, escape,
                                         # or line continuation (escape + newline)
   )                                # (2 end)
  • Perfect. Thanks a lot. This covers all the stuff I was concerned about - and some more in addition with the clauses to preserve formatting. I am a bit concerned about running time so I might take out the preserve formatting stuff to speed it up (since preserving formatting is not a priority). I will also probably use a simpler expression (such as the one I originally posted or something even simpler) to search for existence of comments before running this (so that files/sections with no comments in them can be skipped). You have also definitely inspired to finally get a regexp editor :-) – instantMartin Mar 11 '15 at 07:23
  • @MartinÖstlund - I don't think the preserving formatting constructs slows down performance a bit, since it only acts on comments. –  Mar 11 '15 at 14:50
  • You are perfectly right @sln about not slowing down execution. My mistake - I misread the regular expression. – instantMartin Mar 12 '15 at 06:59
  • I think I found one case (there are more, I'm almost sure), where this is broken: https://gist.github.com/davidhq/1ca7112f589fb6791a317cd40310103e ... can someone confirm? – davidhq Dec 26 '19 at 07:24
  • @davidhq - Re1 there is the Form #1 regex and is meant for engines that support horizontal whitespace and is therefore invalid in JS. So, why do you keep using it in JS code and asking people to test it? Re2 is the regex meant to use with JS, and is Form #2 `/((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^\/"'\\\s]*)/mg` and you can play around with it here https://regex101.com/r/Crs1bX/1. –  Dec 27 '19 at 10:25
  • @davidhq - Thinking a case exists where this is broken is different than showing proof that it is broken. It remains one of the most perfect regex I've ever seen. –  Dec 27 '19 at 10:28
2
import prettier from 'prettier';

function decomment(jsCodeStr) {
  const options = { printWidth: 160, singleQuote: true, trailingComma: 'none' };

  // actually strip comments:
  options.parser = (text, { babel }) => {
    const ast = babel(text);
    delete ast.comments;
    return ast;
  };

  return prettier.format(jsCodeStr, options);
}

Get prettier from https://github.com/prettier/prettier

davidhq
  • 4,660
  • 6
  • 30
  • 40
  • 1
    I tried every solution with regexes to see if some work reasonably well... none did. This is the only accurate way to strip comments.... or some other approach using ASTs (abstract syntax trees). – davidhq Dec 02 '19 at 22:27
  • How many thousands of lines of code is in that module ? `I tried every solution with regexes to see if some work reasonably well... none did` Probably not the accepted regex solution here which works better than anything out there to delete comments. `•Strips both /**/ and // styles •Handles line continuation style •Preserves formatting ` and works very well. –  Dec 07 '19 at 16:45
  • There is just enough code in that module... And since stripping comments 100% correctly with regexes without lexical parsing is *impossible* in JavaScript, there is a very well deserved place for this kind of approach... in my opinion this is the *only correct* approach in modern JavaScript... otherwise you sooner or later get tripped over by total catastrophe when trying to parse with regexes. Cheers :) – davidhq Dec 10 '19 at 12:34
  • `And since stripping comments 100% correctly with regexes without lexical parsing is impossible in JavaScript,` This is incorrect, as that regex works perfectly in all cases. It is rediculous to use a language parser since in C/C++ delimiters, as with most other languages are quotes and comments. So comment parsing is secondary to quote parsing. If you can find a situation where this guys regex doesn't work flawless, please let me know . –  Dec 10 '19 at 20:30
  • Can you try this code and tell me if this regex performed correctly for you? https://gist.github.com/davidhq/1ca7112f589fb6791a317cd40310103e – davidhq Dec 18 '19 at 21:04
  • I can't try the code I'm not a member of github. However, I can tell you the wrong regex was used on that sample there. That regex was _Form #1_ which uses horizontal whitespace construct `\h`. The one for JavaScript is _Form #2_ `/((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^\/"'\\\s]*)/mg` and you can play around with it here https://regex101.com/r/YHTVee/1 ` –  Dec 19 '19 at 23:26
  • If preserving formatting isn't something you need, you can use the bare bones regex `/(\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n|$))|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^\/"'\\]*)/` If it's something in between, let me know and if it's something I can do in 5 minutes, I'll accommodate. If longer, I charge. –  Dec 19 '19 at 23:28
  • Being a _prettier_ fan, love this solution. Thanks! – bvj Mar 25 '20 at 02:52
0

Look at this code. Allthough this is for PHP, I think the pattern is right. You could adapt the pattern for JavaScript.

https://gist.github.com/timw4mail/1338288

El cero
  • 607
  • 5
  • 13
  • The link leads to a html minifier - I'm looking to strip comments from javascript only. However, I thank you for the tip, since minifiers are probably a good place to start looking. After all, javascript minifiers do strip away comments. – instantMartin Mar 10 '15 at 14:24
0

Is possible to do it (no regex pure javascript), but with some limitation. I did implemented something on the fly for you (25 min). Method used is to parse source file line by line. Result is correct if you js file is correct and you do not have 3 exceptions.

Find implantation here: http://jsfiddle.net/ch14em6w/

Here is code key part:

//parse file input
function displayFileLineByLine(contents)
{
    var lines = contents.split('\n');
    var element = document.getElementById('file-content');
    var output = '';
    for(var line = 0; line < lines.length; line++){

        var normedline = stripOut(lines[line]);
        if (normedline.length > 0 )
        {
            output += normedline;
        }
    }
    element.innerHTML = output;  
}
// globa scope flag showing '/*' is open
var GlobalComentOpen = false;

//recursive line coments removal method
function stripOut(stringline, step){
        //index global coment start
        var igcS = stringline.indexOf('/*');
        //index global coment end
        var igcE = stringline.indexOf('*/');
        //index inline coment pos
        var iicP = stringline.indexOf('//');
        var gorecursive = false;
        if (igcS != -1)
        {
            gorecursive = true;
            if (igcS < igcE) { 
                stringline = stringline.replace(stringline.slice(igcS, igcE +2), "");
            }
            else if (igcS > igcE && GlobalComentOpen) {
                stringline = stringline.replace(stringline.slice(0, igcE +2), "");
                igcS = stringline.indexOf('/*');
                stringline = stringline.replace(stringline.slice(igcS, stringline.length), "");
            }
            else if (igcE == -1){
                GlobalComentOpen = true;
                stringline = stringline.replace(stringline.slice(igcS, stringline.length), "");
            }
            else
            {
                console.log('incorect format');
            }

        }
        if (!gorecursive && igcE != -1)
        {
            gorecursive = true;
            GlobalComentOpen = false;
            stringline = stringline.replace(stringline.slice(0, igcE +2), "");
        }
        if (!gorecursive && iicP != -1)
        {
            gorecursive = true;
            stringline = stringline.replace(stringline.slice(iicP, stringline.length), "");
        }
        if (!gorecursive && GlobalComentOpen && step == undefined)
        {
            return "";
        }
        if (gorecursive)
        {
            step = step == undefined ? 0 : step++;
            return stripOut(stringline, step);
        }
        return stringline;
}
SilentTremor
  • 4,747
  • 2
  • 21
  • 34
  • Thanks for the extensive answer with a working solution. I often prefer a non regexp solution since they are often more transparent. Regexp solutions are compact but their functioning can be somewhat opaque and predict. However, in your proposed solution I can't see that you address the problem of comments embedded in string literals, such as "http://www.domain.com", which I mention in the question (or am I missing something in my interpretation of the code). – instantMartin Mar 11 '15 at 06:43
  • You catch me with your question that why I did implemented this, and you are right the only problem remains comments present in string literals. I know how to implement this exceptions but reading everything here I understand that's not the point. So, your original question is legit and on top of what you said there. There are multiple cases where comments may appear and your regexp/JavaScript implementation most prioritize the appearance of comments wrapper: priority1: ''/*", priority2: "*/", priority3: "//" – SilentTremor Mar 11 '15 at 07:12
0

Update: This is C# code, and I think this is not the right place for it. Anyway, here it is.

I use the following class with good results.

Not tested with comments inside strings, e.g.

a = "hi /* comment */ there";
a = "hi there // ";

The class detects // comments in the beginning of a line or after a space at least. So the following works.

a = "hi// there";
a = "hi//there";

Here is the code


    static public class CommentRemover
    {
        static readonly RegexOptions ROptions = RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Multiline; 
 
        const string SSingleLineComments = @"\s//.*";       // comments with // in the beginning of a line or after a space
        const string SMultiLineComments = @"/\*[\s\S]*?\*/";
        const string SCommentPattern = SSingleLineComments + "|" + SMultiLineComments;  
        const string SEmptyLinePattern = @"^\s+$[\r\n]*";

        static Regex CommentRegex;
        static Regex EmptyLineRegex; 

        static public string RemoveEmptyLines(string Text)
        {
            if (EmptyLineRegex == null)
                EmptyLineRegex = new Regex(SEmptyLinePattern, ROptions);

            return EmptyLineRegex.Replace(Text, string.Empty); 
        }
        static public string RemoveComments(string Text)
        {
            if (CommentRegex == null)
                CommentRegex = new Regex(SCommentPattern, ROptions);
            return CommentRegex.Replace(Text, string.Empty);
        }
        static public string RemoveComments(string Text, string Pattern)
        {
            Regex R = new Regex(Pattern, ROptions);
            return R.Replace(Text, string.Empty);
        }
 
        static public string Execute(string Text)
        {
            Text = RemoveComments(Text);
            Text = RemoveEmptyLines(Text);
            return Text;
        }
        static public void ExecuteFile(string SourceFilePth, string DestFilePath)
        {
            string DestFolder = Path.GetDirectoryName(DestFilePath);
            Directory.CreateDirectory(DestFolder);

            string Text = File.ReadAllText(SourceFilePth);
            Text = Execute(Text);
            File.WriteAllText(DestFilePath, Text);
        }
        static public void ExecuteFolder(string FilePattern, string SourcePath, string DestPath, bool Recursive = true)
        {
            string[] FilePathList = Directory.GetFiles(SourcePath, FilePattern, Recursive? SearchOption.AllDirectories: SearchOption.TopDirectoryOnly);
            string FileName;
            string DestFilePath;
            foreach (string SourceFilePath in FilePathList)
            {
                FileName = Path.GetFileName(SourceFilePath);
                DestFilePath = Path.Combine(DestPath, FileName);
                ExecuteFile(SourceFilePath, DestFilePath);
            }
        }
        static public void ExecuteCommandLine(string[] Args)
        {

            void DisplayCommandLineHelp()
            {
                string Text = @"
-h, --help          Flag. Displays this message. E.g. -h
-s, --source        Source folder when the -p is present. Else source filename. E.g. -s C:\app\js or -s C:\app\js\main.js
-d, --dest          Dest folder when the -p is present. Else dest filename. E.g. -d C:\app\js\out or -d C:\app\js\out\main.js
-p, --pattern       The pattern to use when finding files. E.g. -p *.js
-r, --recursive     Flag. Search in sub-folders too. E.g. -r

EXAMPLE
    CommentStripper -s .\Source -d .\Dest -p *.js
";

                Console.WriteLine(Text.Trim());
            }

            string Pattern = null;
            
            string Source = null;
            string Dest = null;

            bool Recursive = false;
            bool Help = false;
 
            string Arg;
            if (Args.Length > 0)
            {
                try
                {
                    for (int i = 0; i < Args.Length; i++)
                    {
                        Arg = Args[i].ToLower();

                        switch (Arg)
                        {
                            case "-s":
                            case "--source":
                                Source = Args[i + 1].Trim();
                                break;
                            case "-d":
                            case "--dest":
                                Dest = Args[i + 1].Trim();
                                break;
                            case "-p":
                            case "--pattern":
                                Pattern = Args[i + 1].Trim();
                                break;
                            case "-r":
                            case "--recursive":
                                Recursive = true;
                                break;
                            case "-h":
                            case "--help":
                                Help = true;
                                break;
                        }

                    }


                    if (Help)
                    {
                        DisplayCommandLineHelp();                        
                    }
                    else
                    {
                        if (!string.IsNullOrWhiteSpace(Pattern))
                        {
                            ExecuteFolder(Pattern, Source, Dest, Recursive);
                        }
                        else
                        {
                            ExecuteFile(Source, Dest);
                        }
 
                    }

                    // Console.ReadLine();
                }
                catch (Exception ex)
                {
                    Console.WriteLine(ex.Message);
                    Console.WriteLine();
                    DisplayCommandLineHelp();
                }
            }



        }
    }

Good luck.

Teo Bebekis
  • 625
  • 8
  • 9