If you want to identify comments with regexes, you really need to use the regex as a tokenizer. I.e., it identifies and extracts the first thing in the string, whether that thing be a string literal, a comment, or a block of stuff that is neither string literal nor comment. Then you grab the remainder of the string and pull the next token off the beginning.
This gets you around the problems with context. If you're just trying to look for things in the middle of the string, there's no good way to identify whether a particular "comment" is inside a string literal or not -- in fact, it's hard to identify where the string literals are in the first place, because of things like \"
. But if you always take the first thing in the string, it's easy to say "oh, the string starts with "
, so everything up to the next unescaped "
is more string." Context takes care of itself.
So you would want three regexes:
- One that identifies a comment starting at the beginning of the string (either a
//
or a /*
comment).
- One that identifies a string literal starting at the beginning of the string. Remember to check for both
"
and @"
strings; each has its own edge cases.
- One that identifies something that is neither of the above, and matches up until the first thing that could be a comment or a string literal.
Writing the actual regex patterns is left as an exercise for the reader, since it would take hours to write and test it all and I'm not willing to do that for free. (grin) But it's certainly doable, if you have a good understanding of regexes (or have a place like StackOverflow to ask specific questions when you get stuck) and are willing to write a bunch of automated tests for your code. Watch out on that last ("anything else") case, though -- you want to stop just before an @
if it's followed by a "
, but not if it's an @
to escape a keyword to use as an identifier.