48

I'm working on a routine to strip block or line comments from some C# code. I have looked at the other examples on the site, but haven't found the exact answer that I'm looking for.

I can match block comments (/* comment */) in their entirety using this regular expression with RegexOptions.Singleline:

(/\*[\w\W]*\*/)

And I can match line comments (// comment) in their entirety using this regular expression with RegexOptions.Multiline:

(//((?!\*/).)*)(?!\*/)[^\r\n]

Note: I'm using [^\r\n] instead of $ because $ is including \r in the match, too.

However, this doesn't quite work the way I want it to.

Here is my test code that I'm matching against:

// remove whole line comments
bool broken = false; // remove partial line comments
if (broken == true)
{
    return "BROKEN";
}
/* remove block comments
else
{
    return "FIXED";
} // do not remove nested comments */ bool working = !broken;
return "NO COMMENT";

The block expression matches

/* remove block comments
else
{
    return "FIXED";
} // do not remove nested comments */

which is fine and good, but the line expression matches

// remove whole line comments
// remove partial line comments

and

// do not remove nested comments

Also, if I do not have the */ positive lookahead in the line expression twice, it matches

// do not remove nested comments *

which I really don't want.

What I want is an expression that will match characters, starting with //, to the end of line, but does not contain */ between the // and end of line.

Also, just to satisfy my curiosity, can anyone explain why I need the lookahead twice? (//((?!\*/).)*)[^\r\n] and (//(.)*)(?!\*/)[^\r\n] will both include the *, but (//((?!\*/).)*)(?!\*/)[^\r\n] and (//((?!\*/).)*(?!\*/))[^\r\n] won't.

Welton v3.62
  • 2,210
  • 7
  • 29
  • 46
  • 4
    Have you also considered the case where `string foo = "http://stackoverflow.com;"` – Anthony Pegram Aug 19 '10 at 17:15
  • 1
    Your `/* ... */` pattern overmatches due to greediness, e.g. consider `/* comment1 */ not-a-comment! /* comment2 */`. – polygenelubricants Aug 19 '10 at 17:20
  • You might consider using a parser for C# instead: http://stackoverflow.com/questions/81406/parser-for-c – TrueWill Aug 19 '10 at 17:22
  • LOL... for this problem, using a full-blown C# parser is absolute overkill. – Timwi Aug 19 '10 at 20:53
  • 1
    An absolutely INVALUABLE tool for designing, understanding and testing RegExs is expresso: http://www.ultrapico.com/Expresso.htm . – eidylon Aug 19 '10 at 21:12
  • I'm surprised someone hasn't been able to simply conjure up the regex used by Visual Studio itself, or Resharper, or any other number of power tools that have to parse out and identify comments in code? – Alain May 12 '14 at 18:49

6 Answers6

101

Both of your regular expressions (for block and line comments) have bugs. If you want I can describe the bugs, but I felt it’s perhaps more productive if I write new ones, especially because I’m intending to write a single one that matches both.

The thing is, every time you have /* and // and literal strings “interfering” with each other, it is always the one that starts first that takes precedence. That’s very convenient because that’s exactly how regular expressions work: find the first match first.

So let’s define a regular expression that matches each of those four tokens:

var blockComments = @"/\*(.*?)\*/";
var lineComments = @"//(.*?)\r?\n";
var strings = @"""((\\[^\n]|[^""\n])*)""";
var verbatimStrings = @"@(""[^""]*"")+";

To answer the question in the title (strip comments), we need to:

  • Replace the block comments with nothing
  • Replace the line comments with a newline (because the regex eats the newline)
  • Keep the literal strings where they are.

Regex.Replace can do this easily using a MatchEvaluator function:

string noComments = Regex.Replace(input,
    blockComments + "|" + lineComments + "|" + strings + "|" + verbatimStrings,
    me => {
        if (me.Value.StartsWith("/*") || me.Value.StartsWith("//"))
            return me.Value.StartsWith("//") ? Environment.NewLine : "";
        // Keep the literal strings
        return me.Value;
    },
    RegexOptions.Singleline);

I ran this code on all the examples that Holystream provided and various other cases that I could think of, and it works like a charm. If you can provide an example where it fails, I am happy to adjust the code for you.

Timwi
  • 65,159
  • 33
  • 165
  • 230
  • I do not need to extract the comments, just strip them out of my source script. I tried your code, and it worked well. Ideally, I'd like to remove any line completely, if the line only contained comments. e.g. no blank lines left where a comment was. However, this is not a requirement, just a formatting preference. Thanks. – Welton v3.62 Aug 20 '10 at 16:57
  • 3
    @Welton: Well, you could just run `Regex.Replace(@"^(\s*\r?\n){2,}", Environment.Newline, RegexOptions.Multiline)` on the result afterwards, but this will remove blank double-lines that *didn’t* have a comment in it too. – Timwi Aug 20 '10 at 17:14
  • I saw you tested this: http://csharp.pastebin.com/0aqBdFE5 but when you have something like this: string input = "1 + 2 //comments"; it fails it gives you as a result "1 + 2 \r\n" because of the Environment.Newline in the ternary operator – juFo Jan 14 '11 at 14:03
  • @juFo: When I tried your input, it failed differently: it actually leaves the comment in. (Which is expected, because the regex requires a newline after it.) I’ve fixed this: http://csharp.pastebin.com/CnH162Gc – Timwi Jan 14 '11 at 16:00
  • does not work when comments directly follow code: "MY_ENUM_CONSTANT=0//comment" – stackPusher Nov 24 '14 at 18:21
  • @stackPusher: The reason that doesn’t work for you is because you don’t have a newline after the comment. – Timwi Dec 07 '14 at 13:32
  • 1
    Very elegant solution. Based on your solution I made something similar for removing SQL comments here: http://stackoverflow.com/a/33947706/3606250 – drizin Nov 26 '15 at 22:22
  • Almost works for me. I just want to check the line comment so that I can allow `eval(req.responseText + '\r\n\r\n//@ sourceURL= ' + file);`. It should start with any whitespace any number of times followed by `//` but I can't get it to work. Any help? – Asken Mar 21 '16 at 16:33
  • "Keep the literal strings where they are." if this is the case, why find them in the first place at all? I don't understand this part. – Saeed Neamati Aug 27 '16 at 10:34
  • @SaeedNeamati: You have to find them in order to keep them wholesale, no matter what they contain. If you don’t do this, the regex will match occurrences of `//` and `/*` inside of string literals, which it shouldn’t. – Timwi Aug 29 '16 at 14:41
  • @Timwi -is http://pastebin.com/CnH162Gc a fixed version, that should be used instead of the snippet from the answer or it's for special case only? – Michael Freidgeim Dec 28 '16 at 07:44
  • 1
    As @Holystream described, this regex will remove `url`s. – Mazdak Shojaie Sep 01 '17 at 12:28
  • @mazdak Can you give an example? URLs aren't a syntax element in C#, so I don't know what you mean. The regex does correctly handle both comments and string literals, including those that contain URLs. – Timwi Sep 14 '17 at 16:33
  • nice solution, but I had to swap out the [http://] or [https://] protocol headers first, then run your code and swap them back in place. – Ji_in_coding Dec 29 '19 at 14:01
  • Is there a way to remove also regions line?. Like when you star a reagion "#region something" and end it: "#endregion". – Soulss Mar 05 '22 at 10:59
9

You could tokenize the code with an expression like:

@(?:"[^"]*")+|"(?:[^"\n\\]+|\\.)*"|'(?:[^'\n\\]+|\\.)*'|//.*|/\*(?s:.*?)\*/

It would also match some invalid escapes/structures (eg. 'foo'), but will probably match all valid tokens of interest (unless I forgot something), thus working well for valid code.

Using it in a replace and capturing the parts you want to keep will give you the desired result. I.e:

static string StripComments(string code)
{
    var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/";
    return Regex.Replace(code, re, "$1");
}

Example app:

using System;
using System.Text.RegularExpressions;

namespace Regex01
{
    class Program
    {
        static string StripComments(string code)
        {
            var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/";
            return Regex.Replace(code, re, "$1");
        }

        static void Main(string[] args)
        {
            var input = "hello /* world */ oh \" '\\\" // ha/*i*/\" and // bai";
            Console.WriteLine(input);

            var noComments = StripComments(input);
            Console.WriteLine(noComments);
        }
    }
}

Output:

hello /* world */ oh " '\" // ha/*i*/" and // bai
hello  oh " '\" // ha/*i*/" and
Qtax
  • 33,241
  • 9
  • 83
  • 121
  • 1
    Wait, why did I answer this 2 years after it has been asked, answered and accepted? Giving practically the same answer? How did it even show up on my list? There must have been some bug or something, I don't do such things. (lol) – Qtax Mar 09 '12 at 14:46
  • I found this is the perfect answer for me(C#), however the regex doesn't work on javascript. – Gongdo Gong Jul 13 '15 at 08:28
8

Before you implement this, you will need to create test cases for it first

  1. Simple comments /* */, //, ///
  2. Multi line comments /* This\nis\na\ntest*/
  3. Comments after line of code var a = "apple"; // test or /* test */
  4. Comments within comments /* This // is a test /, or // This / is a test */
  5. Simple non comments that look like comments, and appears in quotes var comment= "/* This is a test*/", or var url = "http://stackoverflow.com";
  6. Complex non comments taht look like comments: var abc = @" this /* \n is a comment in quote\n*/", with or without spaces between " and /* or */ and "

There are probably more cases out there.

Once you have all of them, then you can create a parsing rule for each of them, or group some of them.

Solving this with regular expression alone probably will be very hard and error-prone, hard to test, and hard to maintain by you and other programmers.

Holystream
  • 962
  • 6
  • 12
  • Holystream, I do have some of the test cases you mentioned, but not all. My sample above covers 1 (partially), 2, 3, and 4. 5 and 6 are good points which I had not considered. – Welton v3.62 Aug 19 '10 at 17:50
  • Holystream, I believe you are making it out to be harder than it is. Matching the two comment styles is really easy with regular expressions — in fact, the C# (and C++) lexer probably does that. This is in contrast to something like HTML, which is hard to match with regexes because HTML tags can nest and because they come in too many different varieties. – Timwi Aug 19 '10 at 17:58
  • @Timwi: Actually, .NET uses a lexical analyzer. The comment symbols are just tokens. http://en.wikipedia.org/wiki/Lexical_analysis – chilltemp Aug 19 '10 at 18:03
  • @Timwi: Can you please give me an example that works with the cases above? I am very interested to know a regular expression that pass those test cases. /\*(.*?)\*/|//.*?\r?\n failed a lot of those test cases. – Holystream Aug 19 '10 at 18:17
  • @Holystream: Have you tried the regex in my answer? You seem to have removed two backslashes from it. If my regex fails, please provide a specific example in which it fails, and comment on my answer instead of this one. Thanks! – Timwi Aug 19 '10 at 20:36
  • @chilltemp: That is what I said. “lexer” is short for “lexical analyzer”. – Timwi Aug 19 '10 at 20:38
  • @Timwi: Thanks for the edited example. I would comment your post, but I don't have enough reputation points yet :) It seems to be working better, though it still failed on multiple line comments such as /* Line 1 * Line 2 * Line 3*/ or var url = "http://stackoverflow.com"; // Stackoverflow website. – Holystream Aug 19 '10 at 21:14
  • @Holystream: I tried both examples and they work fine for me. [Here is the full code I’ve used for you to play with.](http://csharp.pastebin.com/0aqBdFE5) – Timwi Aug 19 '10 at 21:37
  • @Timwi: +1. Thanks, very educational. – Holystream Aug 19 '10 at 22:52
2

I found this one at http://gskinner.com/RegExr/ (named ".Net Comments aspx")

(//[\t|\s|\w|\d|\.]*[\r\n|\n])|([\s|\t]*/\*[\t|\s|\w|\W|\d|\.|\r|\n]*\*/)|(\<[!%][ \r\n\t]*(--([^\-]|[\r\n]|-[^\-])*--[ \r\n\t%]*)\>)

When I test it it seems to remove all // comments and /* comments */ as it should, leaving those inside quotes behind.

Haven't tested it a lot, but seems to work pretty well (even though its a horrific monstrous line of regex).

einord
  • 2,278
  • 2
  • 20
  • 26
  • Ok.. after some testing I noticed that is has problems with comments containing minus sign (-) and multiple multi line comments (/* comment */ not comment /* comment again*/). But if anyone cares to fix this, I think its a pretty good solution. – einord May 15 '13 at 13:55
1

for block Comments (/* ... */) you can use this exp:

/\*([^\*/])*\*/

it will work with multiline comments also.

Guy P
  • 1,395
  • 17
  • 33
0

Also see my project for C# code minification: CSharp-Minifier

Aside of removing of comments, spaces and and line breaks from code, at present time it's able to compress local variable names and do another minifications.

Ivan Kochurkin
  • 4,413
  • 8
  • 45
  • 80