Using regex to remove comments from source files

Question

I'm making a program to automate the writing of some C code, (I'm writing to parse strings into enumerations with the same name) C's handling of strings is not that great. So some people have been nagging me to try python.

I made a function that is supposed to remove C-style /* COMMENT */ and //COMMENT from a string: Here is the code:

def removeComments(string):
    re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurance streamed comments (/*COMMENT */) from string
    re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurance singleline comments (//COMMENT\n ) from string

So I tried this code out.

str="/* spam * spam */ eggs"
removeComments(str)
print str

And it apparently did nothing.

Any suggestions as to what I've done wrong?

There's a saying I've heard a couple of times:

If you have a problem and you try to solve it with Regex you end up with two problems.

EDIT: Looking back at this years later. (after a fair bit more parsing experience)

I think regex may have been the right solution. And the simple regex used here "good enough". I may not have emphasized this enough in the question. This was for a single specific file. That had no tricky situations. I think it would be a lot less maintenance to keep the file being parsed simple enough for the regex, than to complicate the regex, into an unreadable symbol soup. (e.g. require that the file only use // single line comments.)

There's really only one reasonable reply: http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html. He was talking about a different language, but his conclusion remains valid. — Jerry Coffin, Feb 23 '10 at 14:59
@Jerry - Strictly speaking, you can often guess a reasonable nesting limit, and define a regular approximation of the language. Many compilers have a comment-nesting limit anyway. But - what limit is safe? Also, I don't want to debug the regex. Good link either way. — , Feb 23 '10 at 17:23
@Steve314: you can guess at a reasonable nesting limit (e.g. in C, comments simply don't nest at all), but that does little good. Just for an obvious example, a comment delimiter in a string literal doesn't count, but a comment delimiter broken across lines (with a back-slash between the characters) *does* count. Taking either into account correctly in an RE is non-trivial at best. — Jerry Coffin, Feb 23 '10 at 18:07
@JerryCoffin Actually, the reasonable reply would be http://stackoverflow.com/a/1732454/321973 — Tobias Kienzler, Aug 22 '13 at 13:37
On a different note, couldn't you just use a C++ preprocessor? — Tobias Kienzler, Aug 22 '13 at 13:44
I see, but that's probably not intended to be used in your case. Anyway, I see [Onur YILDIRIM's answer](http://stackoverflow.com/a/18381470/321973) even manages quotes and comments interleaving — Tobias Kienzler, Aug 23 '13 at 07:03

Onur Yıldırım · Answer 1 · 2020-01-18T02:02:30.987

54

What about "//comment-like strings inside quotes"?

OP is asking how to do do it using regular expressions; so:

def remove_comments(string):
    pattern = r"(\".*?\"|\'.*?\')|(/\*.*?\*/|//[^\r\n]*$)"
    # first group captures quoted strings (double or single)
    # second group captures comments (//single-line or /* multi-line */)
    regex = re.compile(pattern, re.MULTILINE|re.DOTALL)
    def _replacer(match):
        # if the 2nd group (capturing comments) is not None,
        # it means we have captured a non-quoted (real) comment string.
        if match.group(2) is not None:
            return "" # so we will return empty to remove the comment
        else: # otherwise, we will return the 1st group
            return match.group(1) # captured quoted-string
    return regex.sub(_replacer, string)

This WILL remove:

/* multi-line comments */
// single-line comments

Will NOT remove:

String var1 = "this is /* not a comment. */";
char *var2 = "this is // not a comment, either.";
url = 'http://not.comment.com';

Note: This will also work for Javascript source.

edited Jan 18 '20 at 02:02

answered Aug 22 '13 at 13:12

Onur Yıldırım

32,327
12
84
98

1

What about `/* Maybe we // "shouldn't /* do this */// but let's do it */ //" anyway` (insert linebreacks at will) – Tobias Kienzler Aug 22 '13 at 13:41
Neat, it even manages interleaving à la `'a = /* a "-nested string */ "comments can end with */" // comment2'`! I see grouping makes regexes much more powerful _and_ understandable. I wonder whether one could still construct valid C-code that makes your regex fail ;) But that would probably involve some very nasty precompiler stuff... – Tobias Kienzler Aug 23 '13 at 06:59
This method fails if you have escaped quotes, e.g. `"some 2\" string" /* remove this */ "another string"` – ishmael Mar 10 '14 at 18:54
6

Here's an improved regex that does a negative lookbehind to avoid escaped quotes. `r"(\".*?(?<!\\)\"|\'.*?(?<!\\)\')|(/\*.*?\*/|//[^\r\n]*$)"` – ishmael Mar 10 '14 at 19:09
2

@ishmael What about `"string with backslash \\" /* remove this */ "..."`? – Mariano Oct 05 '15 at 09:04
@Mariano Replace the look-behind with [this](https://stackoverflow.com/a/24209736/5223757) look-behind. – wizzwizz4 Apr 01 '18 at 15:31
@Mariano Why would it be better to unroll the loop? Regular expressions are powerful enough to do the job, and they fulfill the question requirements. – wizzwizz4 Apr 01 '18 at 17:38
@wizzwizz4 unrolling the loop is a technique that also applies to regex. Check [some of the answers](https://stackoverflow.com/search?tab=votes&q=%5bregex%5d%20unrolling%20the%20loop) – Mariano Apr 01 '18 at 17:41
@Mariano So it should really be replaced by [this](https://stackoverflow.com/a/5696141/5223757) one? – wizzwizz4 Apr 01 '18 at 17:50
I wanted to match SQL comments, I modified it a little and line returns were not included for me, I used this: `(\".*?(?<!\\)\"|\'.*?(?<!\\)\')|(\/\*[\S\n\t\v ]*?\*\/|--[^\r\n]*$)` – Dubrzr Oct 21 '21 at 08:22

score 48 · Accepted Answer · edited Apr 07 '19 at 15:51

48

re.sub returns a string, so changing your code to the following will give results:

def removeComments(string):
    string = re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurrences streamed comments (/*COMMENT */) from string
    string = re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurrence single-line comments (//COMMENT\n ) from string
    return string

edited Apr 07 '19 at 15:51

kalehmann

4,821
6
26
36

answered Feb 23 '10 at 15:04

msanders

5,739
1
29
30

5

`char *note = "You can make a one-line comment with //";` Oops. – Mike Graham Feb 23 '10 at 15:47
Indeed. This only answers why the OP's function returned nothing. – msanders Feb 23 '10 at 16:15
this is the technically correct answer to the question. Maybe using a phasor is a better way to solve my problem, but this made my code work. – Frames Catherine White Feb 24 '10 at 07:41
string = re.sub(re.compile("^//.*?$", re.MULTILINE ) ,"" ,string) – Nicholas Franceschina Feb 09 '11 at 23:05
The code above doesn't work for me for single-line comments. removeComments("// single-line comments") returns '// single-line comments' – Bill Mar 15 '15 at 19:34
Is this because the text `"// single-line comments"` has no newline as the last symbol? – Gombat Oct 21 '15 at 12:32
@MikeGraham it's so simple though that if your use case can reasonably have the constraint "no comments inside of strings", then it's an excellent solution. I can think of exactly zero times in my career where I've had to both strip comments from a file *and* the file contained strings containing comments. My use case for this is stripping comments from jsonc files before parsing, and this solution works great and doesn't require me to pull in a 3rd party package. – Gillespie Aug 01 '22 at 18:24

jathanism · Answer 3 · 2010-02-23T20:03:51.623

24

I would suggest using a REAL parser like SimpleParse or PyParsing. SimpleParse requires that you actually know EBNF, but is very fast. PyParsing has its own EBNF-like syntax but that is adapted for Python and makes it a breeze to build powerfully accurate parsers.

Edit:

Here is an example of how easy it is to use PyParsing in this context:

>>> test = '/* spam * spam */ eggs'
>>> import pyparsing
>>> comment = pyparsing.nestedExpr("/*", "*/").suppress()
>>> print comment.transformString(test)         
' eggs'

Here is a more complex example using single and multi-line comments.

Before:

/*
 * multiline comments
 * abc 2323jklj
 * this is the worst C code ever!!
*/
void
do_stuff ( int shoe, short foot ) {
    /* this is a comment
     * multiline again! 
     */
    exciting_function(whee);
} /* extraneous comment */

After:

>>> print comment.transformString(code)   

void
do_stuff ( int shoe, short foot ) {

     exciting_function(whee);
}

It leaves an extra newline wherever it stripped comments, but that could be addressed.

edited Feb 23 '10 at 20:03

answered Feb 23 '10 at 15:00

jathanism

33,067
9
68
86

Regex is bad, but parsing is overkill? I am confused; what else is there? – jathanism Feb 23 '10 at 15:10
I was looking at the problem wrong - searching based on a simple alternation regex is much easier than writing a parser. That said, it doesn't address confusion caused by things in strings. A parser (or Lexer) as Mike commented may be exactly the right tool for the job. – Feb 23 '10 at 17:17
Yeah, Regex is "easy" if your input is easy such as things with consistent format like IP addresses or phone numbers. For all other things: lexer. – jathanism Feb 23 '10 at 19:38
I don't think it's leave an extra newline - the newline just isn't part of the comment so it's not stripped, and it's not necessarily safe to do so as newline *can* be used as significant whitespace in C – John La Rooy Feb 24 '10 at 01:35
Ah, that is a good observation and now that I look at it again I agree. :) – jathanism Feb 24 '10 at 15:18

score 8 · Answer 4 · answered Mar 24 '21 at 16:42

Found another solution with pyparsing following Jathanism.

import pyparsing

test = """
/* Code my code
xx to remove comments in C++
or C or python */

include <iostream> // Some comment

int main (){
    cout << "hello world" << std::endl; // comment
}
"""
commentFilter = pyparsing.cppStyleComment.suppress()
# To filter python style comment, use
# commentFilter = pyparsing.pythonStyleComment.suppress()
# To filter C style comment, use
# commentFilter = pyparsing.cStyleComment.suppress()

newtest = commentFilter.transformString(test)
print(newest)

Produces the following output:

include <iostream> 

int main (){
    cout << "hello world" << std::endl; 
}

Can also use pythonStyleComment, javaStyleComment, cppStyleComment. Found it pretty useful.

score 4 · Answer 5 · answered Feb 23 '10 at 15:07

4

I would recommend you read this page that has a quite detailed analyzis of the problem and gives a good understanding on why your approach doesn't work: http://ostermiller.org/findcomment.html

Short version: The regex you are looking for is this:

(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)

This should match both types of comment blocks. If you are having troubles following it read the page i linked.

answered Feb 23 '10 at 15:07

MatsT

1,759
3
20
33

Unless I've missed something, this would miss a comment delimiter spliced across lines (slash, backslash, new-line, asterisk or asterisk, backslash, newline, slash). Worse, that backslash can be generated as the trigraph sequence `??/` (though I'll admit trigraphs are pretty rare). – Jerry Coffin Feb 23 '10 at 18:12

Otto Allmendinger · Answer 6 · 2010-02-23T16:06:32.127

1

You are doing it wrong.

Regex is for Regular Languages, which C isn't.

edited Feb 23 '10 at 16:06

answered Feb 23 '10 at 14:58

Otto Allmendinger

27,448
7
68
79

Of course one of the common expected differences between a lexer and a parser is that a lexer only supports a regular language. Not always true of course (e.g. see Ragel) as with regex. A good lexer can do the job, but as with using a parser, it seems like massive overkill just for comment stripping. – Feb 23 '10 at 15:10
@Steve314, If by "overkill" you mean *totally the right tool for the job*, then yeah. All of the regexps posted here are extremely buggy and will not do the right thing when faced with valid, realistic C(++) code. – Mike Graham Feb 23 '10 at 15:45
Read up about the lexer, removed my recommendation of a lexer – Otto Allmendinger Feb 23 '10 at 16:50
@Mike - On second thoughts, I agree - but the specific reason hasn't been mentioned (though it's a special case of your "valid, realistic" point). I just thought about things that look like comment markers, but are actually just characters in string literals. Avoiding those without the right tools would be a nasty job. Grab an existing C lexer (as long as it preserves the whitespace) - not so bad. – Feb 23 '10 at 17:13
@Mike - my own answer deleted as a result - considered harmful. – Feb 23 '10 at 17:14
@Steve314, That is the obvious valid, realistic code. (Like the example I posted in reply to msanders earlier.) – Mike Graham Feb 23 '10 at 17:52

score 1 · Answer 7 · answered Feb 23 '10 at 15:08

I see several things you might want to revise.

First, Python passes objects by value, but some object types are immutable. Strings and integers are among these immutable types. So if you pass a string to a function, any changes to the string you make within the function won't affect the string you passed in. You should try returning a string instead. Furthermore, within the removeComments() function, you need to assign the value returned by re.sub() to a new variable -- like any function that takes a string as an argument, re.sub() will not modify the string.

Second, I would echo what others have said about parsing C code. Regular expressions are not the best way to go here.

score 1 · Answer 8 · answered Feb 24 '10 at 00:31

1

mystring="""
blah1 /* comments with
multiline */

blah2
blah3
// double slashes comments
blah4 // some junk comments

"""
for s in mystring.split("*/"):
    s=s[:s.find("/*")]
    print s[:s.find("//")]

output

$ ./python.py

blah1


blah2
blah3

answered Feb 24 '10 at 00:31

ghostdog74

327,991
56
259
343

score 0 · Answer 9 · edited May 23 '17 at 12:32

As noted in one of my other comments, comment nesting isn't really the problem (in C, comments don't nest, though a few compilers to support nested comments anyway). The problem is with things like string literals, that can contain the exact same character sequence as a comment delimiter without actually being one.

As Mike Graham said, the right tool for the job is a lexer. A parser is unnecessary and would be overkill, but a lexer is exactly the right thing. As it happens, I posted a (partial) lexer for C (and C++) earlier this morning. It doesn't attempt to correctly identify all lexical elements (i.e. all keywords and operators) but it's entirely sufficient for stripping comments. It won't do any good on the "using Python" front though, as it's written entirely in C (it predates my using C++ for much more than experimental code).

Saurabh Kukreti · Answer 10 · 2018-10-21T14:09:26.467

0

Just want add another regex where we have to remove anything between * and ; in python

data = re.sub(re.compile("*.*?\;",re.DOTALL),' ',data)

there is backslash before * to escape the meta character.

edited Oct 21 '18 at 14:09

answered Oct 21 '18 at 14:01

Saurabh Kukreti

119
1
6

Why is the related to original question? – E.Coms Oct 21 '18 at 14:39

harishli2020 · Answer 11 · 2013-07-21T20:45:27.890

-2

This program removes comments with // and /* */ from the given file:

#! /usr/bin/python3
import sys
import re
if len(sys.argv)!=2:
     exit("Syntax:python3 exe18.py inputfile.cc ")
else:
     print ('The following files are given by you:',sys.argv[0],sys.argv[1])
with open(sys.argv[1],'r') as ifile:
    newstring=re.sub(r'/\*.*?\*/',' ',ifile.read(),flags=re.S)
with open(sys.argv[1],'w') as ifile:
    ifile.write(newstring)
print('/* */ have been removed from the inputfile')
with open(sys.argv[1],'r') as ifile:
      newstring1=re.sub(r'//.*',' ',ifile.read())
with open(sys.argv[1],'w') as ifile:
      ifile.write(newstring1)
print('// have been removed from the inputfile')

edited Jul 21 '13 at 20:45

answered Jul 21 '13 at 20:01

harishli2020

305
6
19

Sorry, didn't understand what is PO? – harishli2020 Jul 21 '13 at 20:35
Sorry I meant OP (http://meta.stackexchange.com/questions/79804/whats-stackexchange-ese-for-op) – hivert Jul 21 '13 at 20:36
The people probably downvoted you because you read and write from the file multiple times. Should rather read into memory once, perform your operations, then when done write to file once. Its much faster to work from memory. – run_the_race Nov 12 '21 at 17:31

Using regex to remove comments from source files

11 Answers11

Linked