I am trying to count characters in comments included in C code using Python and Regex, but no success. I can erase strings first to get rid of comments in strings, but this will erase string in comments too and result will be bad ofc. Is there any chance to ask by using regex to not match strings in comments or vice versa?
-
1This is one of those things that regex isn't supposed to be used for, I believe – Carson Myers Apr 04 '10 at 20:47
3 Answers
No, not really.
Regex is not the correct tool to parse nested structures like you describe; instead you will need to parse the C syntax (or the "dumb subset" of it you're interested in, anyway), and you might find regex helpful in that. A relatively simple state machine with three states (CODE, STRING, COMMENT) would do it.
-
2You might need a fourth state for character constants. In C, the text `'/*'` is a multi-character character constant; it has undefined or implementation-defined behaviour, but does not start a comment. – Jonathan Leffler Apr 04 '10 at 20:58
Regular expressions are not always a replacement for a real parser.

- 776,304
- 153
- 1,341
- 1,358
You can strip out all strings that aren't in comments by searching for the regular expression:
'[^'\r\n]+'|(//.*|/\*(?s:.*?)\*/)
and replacing with:
$1
Essentially, this searches for the regex string|(comment)
which matches a string or a comment, capturing the comment. The replacement is either nothing if a string was matched or the comment if a comment was matched.
Though regular expressions are not a replacement for a real parser you can quickly build a rudimentary parser by creating a giant regex that alternates all of the tokens you're interested in (comments and strings in this case). If you're writing a bit of code to handle comments, but not those in strings, iterate over all the matches of the above regex, and count the characters in the first capturing group if it participated in the match.

- 21,379
- 7
- 60
- 72