16

How can I remove all comments and blank lines from a C# source file. Have in mind that there could be a nested comments. Some examples:

string text = @"//not a comment"; // a comment

/* multiline
comment */ string newText = "/*not a comment*/"; // a comment

/* multiline // not a comment 
/* comment */ string anotherText = "/* not a comment */ // some text here\"// not a comment"; // a comment

We can have much more complex source than those three examples above. Can some one suggest a regex pattern or other way to solve this. I've already browsed a lot a stuff over the internet and coudn't find anything that works.

johannes
  • 7,262
  • 5
  • 38
  • 57
nenito
  • 1,214
  • 6
  • 19
  • 33

7 Answers7

6

To remove the comments, see this answer. After that, removing empty lines is trivial.

Community
  • 1
  • 1
sga101
  • 1,904
  • 13
  • 12
  • @nenito, I guess I posted my answer a bit late, but it could be of interest anyway. – Qtax Feb 02 '12 at 20:29
  • We are still curious why you would want to remove the comments! (or at least I am) – comecme Feb 02 '12 at 20:39
  • 1
    @comecme: first - I'm sorry for the late answer. comments are slowing down the readability of the code, so this could be useful when you deploy your code to have some kind of filter against comments, but you could store your code with all comments on some repository(SVN, Perforce, ..) – nenito Apr 15 '13 at 12:07
6

You could use the function in this answer:

static string StripComments(string code)
{
    var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/";
    return Regex.Replace(code, re, "$1");
}

And then remove empty lines.

Community
  • 1
  • 1
Qtax
  • 33,241
  • 9
  • 83
  • 121
1

Also see my project for C# code minification: CSharp-Minifier

Aside of removing of comments, spaces and and line breaks from code, at present time it's able to compress local variable names and do another minifications.

Ivan Kochurkin
  • 4,413
  • 8
  • 45
  • 80
  • It is really cool stuff;) But the GUI is not convenient for usage (seems, like it was written just for author's purposes), but it is not hard to create a wrapper: – maxkoryukov Aug 16 '16 at 04:42
  • I've tested the wrapper on a small real application (two projects, about 40-50 files), and code compiles without modifications – maxkoryukov Aug 16 '16 at 04:42
  • @maxkoryukov yes, GUI has been developed for private using :) If you want you can create pull request with your modifications. Moreover some issues can be resolved with Roslyn code analyzer. – Ivan Kochurkin Aug 16 '16 at 07:49
  • I've used your lib in a rush, and currently I have just a well-working gist. Next time when I will use this app - I will send you a PR with a tiny console app, which utilizes your library;) – maxkoryukov Aug 16 '16 at 20:17
1

Unfortunatly this is really difficult to do reliably with regex without there being edge cases. I havnt investigated very far but you might be able to use the Visual Studio Language Services to parse comments.

Sam Greenhalgh
  • 5,952
  • 21
  • 37
1

If you want to identify comments with regexes, you really need to use the regex as a tokenizer. I.e., it identifies and extracts the first thing in the string, whether that thing be a string literal, a comment, or a block of stuff that is neither string literal nor comment. Then you grab the remainder of the string and pull the next token off the beginning.

This gets you around the problems with context. If you're just trying to look for things in the middle of the string, there's no good way to identify whether a particular "comment" is inside a string literal or not -- in fact, it's hard to identify where the string literals are in the first place, because of things like \". But if you always take the first thing in the string, it's easy to say "oh, the string starts with ", so everything up to the next unescaped " is more string." Context takes care of itself.

So you would want three regexes:

  • One that identifies a comment starting at the beginning of the string (either a // or a /* comment).
  • One that identifies a string literal starting at the beginning of the string. Remember to check for both " and @" strings; each has its own edge cases.
  • One that identifies something that is neither of the above, and matches up until the first thing that could be a comment or a string literal.

Writing the actual regex patterns is left as an exercise for the reader, since it would take hours to write and test it all and I'm not willing to do that for free. (grin) But it's certainly doable, if you have a good understanding of regexes (or have a place like StackOverflow to ask specific questions when you get stuck) and are willing to write a bunch of automated tests for your code. Watch out on that last ("anything else") case, though -- you want to stop just before an @ if it's followed by a ", but not if it's an @ to escape a keyword to use as an identifier.

Joe White
  • 94,807
  • 60
  • 220
  • 330
0

Use my project to remove most comments. https://github.com/SynAppsDevelopment/CommentRemover

It removes all full-line, ending-line, and XML Doc code comments with some limitations for complex comments explained in the readme and source. This is a C# solution with a WinForms interface.

Jowe
  • 63
  • 8
  • Please don't just post some tool or library as an answer. At least demonstrate [how it solves the problem](http://meta.stackoverflow.com/a/251605) in the answer itself. – Papershine Feb 20 '18 at 04:49
  • Sorry, didn't know all the guidelines. Did my edit help? – Jowe Feb 20 '18 at 05:47
0

First, you'll definitely want to use the RegexOptions.SingleLine when constructing your RegEx instance. Right now, you are processing single lines of code.

To compliment the using of the RegexOptions.SingleLine option, you'll want to make sure you use the start and end string anchors (^ and $ respectively), as for the specific cases you have, you want the regular expression to apply to the entire string.

I'd also recommend breaking up the conditions and using alternation to handle smaller cases, constructing a larger regular expression from the smaller, easier-to-manage expressions.

Finally, I know this is homework, but parsing a software language with regular expressions is an exercise in futility (it's not a practical application). It's better for more highly structured data. If you find in the future you want to do things like this, use a parser which is built for the language, (in this case, I'd highly recommend Roslyn).

casperOne
  • 73,706
  • 19
  • 184
  • 253
  • Lost me with the last paragraph... I have not had trouble with implementing my C# lexer using regex, other than in stripping comments. I do feel that comments are a unique part of the process, since they do not contribute to the tokens that must be passed to the syntaxer. http://en.wikipedia.org/wiki/Regular_language – Joshua W May 18 '14 at 04:43