50

I'm looking for Python code that removes C and C++ comments from a string. (Assume the string contains an entire C source file.)

I realize that I could .match() substrings with a Regex, but that doesn't solve nesting /*, or having a // inside a /* */.

Ideally, I would prefer a non-naive implementation that properly handles awkward cases.

jww
  • 97,681
  • 90
  • 411
  • 885
TomZ
  • 777
  • 1
  • 7
  • 12
  • 2
    @QuantumPete, to improve readability and comprehensibility. The quickest approach is to use a colorizing editor and set comment color equal to background color. – Thomas L Holaday May 23 '09 at 00:12
  • 2
    @QuantumPete Or because we're trying to preprocess source code for a subsequent processor that doesn't take sane comments – Damian Yerrick Feb 06 '17 at 03:27
  • I would suggest [this](https://stackoverflow.com/a/53551634/3625404). (I wrote it.) – qeatzy Nov 30 '18 at 05:46

13 Answers13

104

This handles C++-style comments, C-style comments, strings and simple nesting thereof.

def comment_remover(text):
    def replacer(match):
        s = match.group(0)
        if s.startswith('/'):
            return " " # note: a space and not an empty string
        else:
            return s
    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )
    return re.sub(pattern, replacer, text)

Strings needs to be included, because comment-markers inside them does not start a comment.

Edit: re.sub didn't take any flags, so had to compile the pattern first.

Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.

Edit3: Fixed the case where a legal expression int/**/x=5; would become intx=5; which would not compile, by replacing the comment with a space rather then an empty string.

Scis
  • 2,934
  • 3
  • 23
  • 37
Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
  • This doesn't handle escaped " chars in strings. eg: char *some_punctuation_chars=".\"/*"; /* comment */ – Brian Oct 29 '08 at 12:45
  • Yes it does. `\\.` will match any escaped char, including `\"`. – Markus Jarderot Oct 29 '08 at 19:37
  • 4
    Also you can preserve line numbering relative to the input file by changing the first return to: return "" + "\n" * s.count('\n') I needed to do this in my situation. – atikat Feb 03 '10 at 06:27
  • So I think it would fail on various RegExp strings (e.g. `/\//` or `/\/*/` or `/'/; //blah`) and multiline strings (http://davidwalsh.name/multiline-javascript-strings). i.e. usable for simple code, but probably not for larger production codebases. If I had to use Python I would look for solutions using pynoceros or pynarcissus. If you can use node.js then UglifyJS2 is a good base for munging JavaScript code. – robocat Apr 26 '13 at 06:00
  • @robocat True. But Regex literals are not part of the C language. If you wish to parse code with Regex literals, you could add this at the end of the Regex: `|/(?:\\.|[^\\/])+/`. The condition in the `replacer()` function would also have to be tweaked. – Markus Jarderot Apr 26 '13 at 06:12
  • 1
    @markus-jarderot - Good point! I forgot it was C because I was looking for an ECMAScript solution! With C the regex can also fail on preprocessor statements (removing lines beginning with # is probably an easy fix for that issue though) so as it stands it doesn't solve "properly handles awkward cases". Also doesn't C have multiline strings using \ and does this handle those? – robocat May 08 '13 at 04:42
  • @robocat It does handle escapes, but not pre-processor statements. Neither does it handle [Digraphs and trigraphs](http://en.wikipedia.org/wiki/Digraphs_and_trigraphs), but that is normally not a problem. For pre-processor statements, you could add `|#[^\r\n]*(?:\\\r?\n[^\r\n]*)*` at the end of the regex. – Markus Jarderot May 08 '13 at 05:39
  • This fails for me (python2 and python3) on the simple string `blah "blah"` with error `TypeError: sequence item 1: expected string, module found`. – Mark Smith Mar 08 '18 at 14:20
  • It leaves a newline after removing a multi line comment. any fix for this? – Aman Deep May 12 '20 at 14:24
  • @AmanDeep You could add `[^\S\r\n]*\r?\n?` after `\*/` to include whitespace up until and including the following newline, if any. – Markus Jarderot May 13 '20 at 10:38
30

C (and C++) comments cannot be nested. Regular expressions work well:

//.*?\n|/\*.*?\*/

This requires the “Single line” flag (Re.S) because a C comment can span multiple lines.

def stripcomments(text):
    return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)

This code should work.

/EDIT: Notice that my above code actually makes an assumption about line endings! This code won't work on a Mac text file. However, this can be amended relatively easily:

//.*?(\r\n?|\n)|/\*.*?\*/

This regular expression should work on all text files, regardless of their line endings (covers Windows, Unix and Mac line endings).

/EDIT: MizardX and Brian (in the comments) made a valid remark about the handling of strings. I completely forgot about that because the above regex is plucked from a parsing module that has additional handling for strings. MizardX's solution should work very well but it only handles double-quoted strings.

Doron Yaacoby
  • 9,412
  • 8
  • 48
  • 59
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 3
    1. use `$` and re.MULTILINE instead of `'\n', '\r\n', etc – jfs Oct 27 '08 at 21:46
  • This doesn't handle the case of a line ending in a backslash, which indicates a continued line, but that case is extremely rare – Adam Rosenfield Oct 27 '08 at 22:00
  • You've missed the replacement blank string in the re.sub. Also, this won't work for strings. Eg. consider 'string uncPath = "//some_path";' or 'char operators[]="/*+-";' For language parsing, I think you're best off using a real parser. – Brian Oct 27 '08 at 22:01
  • Your code doesn't handle abuse of comments, such as a backslash-newline in between the two start-of-comment symbols, or between the star-slash that ends a classic C-style comment. There's a strong sense in which it "doesn't matter; no-one in their right mind writes comments like that". YMMV. – Jonathan Leffler Oct 28 '08 at 17:55
  • @Jonathan: Wow, I didn't think this would compile. Redefines the meaning of “lexeme”. By the way, are there syntax highlighters (IDEs, code editors) that support this? Neither VIM nor Visual Studio do. – Konrad Rudolph Oct 28 '08 at 20:13
  • "C (and C++) comments cannot be nested." Some compilers (well, at least Borland's (free) version 5.5.1) allow nested C-style comments via a command line switch. – PTBNL Aug 18 '09 at 14:47
7

Don't forget that in C, backslash-newline is eliminated before comments are processed, and trigraphs are processed before that (because ??/ is the trigraph for backslash). I have a C program called SCC (strip C/C++ comments), and here is part of the test code...

" */ /* SCC has been trained to know about strings /* */ */"!
"\"Double quotes embedded in strings, \\\" too\'!"
"And \
newlines in them"

"And escaped double quotes at the end of a string\""

aa '\\
n' OK
aa "\""
aa "\
\n"

This is followed by C++/C99 comment number 1.
// C++/C99 comment with \
continuation character \
on three source lines (this should not be seen with the -C fla
The C++/C99 comment number 1 has finished.

This is followed by C++/C99 comment number 2.
/\
/\
C++/C99 comment (this should not be seen with the -C flag)
The C++/C99 comment number 2 has finished.

This is followed by regular C comment number 1.
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.

/\
\/ This is not a C++/C99 comment!

This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.

/\
\* This is not a C or C++  comment!

This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.

This is followed by regular C comment number 3.
/\
\
\
\
* C comment */

This does not illustrate trigraphs. Note that you can have multiple backslashes at the end of a line, but the line splicing doesn't care about how many there are, but the subsequent processing might. Etc. Writing a single regex to handle all these cases will be non-trivial (but that is different from impossible).

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • I would also add that if anyone wrote a comment with the comment start or end symbols split over lines, I'd persuade them of the error of their ways. And extending a single-line comment with a trailing backslash is also similarly evil. So, the problems here are more imaginary than real - unless you're a C compiler writer. – Jonathan Leffler Jul 05 '10 at 17:13
6

This posting provides a coded-out version of the improvement to Markus Jarderot's code that was described by atikat, in a comment to Markus Jarderot's posting. (Thanks to both for providing the original code, which saved me a lot of work.)

To describe the improvement somewhat more fully: The improvement keeps the line numbering intact. (This is done by keeping the newline characters intact in the strings by which the C/C++ comments are replaced.)

This version of the C/C++ comment removal function is suitable when you want to generate error messages to your users (e.g. parsing errors) that contain line numbers (i.e. line numbers valid for the original text).

import re

def removeCCppComment( text ) :

    def blotOutNonNewlines( strIn ) :  # Return a string containing only the newline chars contained in strIn
        return "" + ("\n" * strIn.count('\n'))

    def replacer( match ) :
        s = match.group(0)
        if s.startswith('/'):  # Matched string is //...EOL or /*...*/  ==> Blot out all non-newline chars
            return blotOutNonNewlines(s)
        else:                  # Matched string is '...' or "..."  ==> Keep unchanged
            return s

    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )

    return re.sub(pattern, replacer, text)
Menno Rubingh
  • 61
  • 1
  • 4
4

The regular expression cases will fall down in some situations, like where a string literal contains a subsequence which matches the comment syntax. You really need a parse tree to deal with this.

Alex Coventry
  • 68,681
  • 4
  • 36
  • 40
4

I don't know if you're familiar with sed, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:

import subprocess
from cStringIO import StringIO

input = StringIO(source_code) # source_code is a string with the source code.
output = StringIO()

process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
    input=input, output=output)
return_code = process.wait()

stripped_code = output.getvalue()

In this program, source_code is the variable holding the C/C++ source code, and eventually stripped_code will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the input and output variables be file handles pointing to those files (input in read-mode, output in write-mode). remccoms3.sed is the file from the above link, and it should be saved in a readable location on disk. sed is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.

This will probably be better than a pure Python solution; no need to reinvent the wheel.

zvoase
  • 786
  • 6
  • 12
3

you may be able to leverage py++ to parse the C++ source with GCC.

Py++ does not reinvent the wheel. It uses GCC C++ compiler to parse C++ source files. To be more precise, the tool chain looks like this:

source code is passed to GCC-XML GCC-XML passes it to GCC C++ compiler GCC-XML generates an XML description of a C++ program from GCC's internal representation. Py++ uses pygccxml package to read GCC-XML generated file. The bottom line - you can be sure, that all your declarations are read correctly.

or, maybe not. regardless, this is not a trivial parse.

@ RE based solutions - you are unlikely to find a RE that handles all possible 'awkward' cases correctly, unless you constrain input (e.g. no macros). for a bulletproof solution, you really have no choice than leveraging the real grammar.

Dustin Getz
  • 21,282
  • 15
  • 82
  • 131
  • Also, as Alex Coventry mentions, simple regexes will hose string literals that happen to contain comment markers (which is perfectly legal). – nobody Oct 28 '08 at 03:42
2

I have using the pygments to parse the string and then ignore all tokens that are comments from it. Works like a charm with any lexer on pygments list including Javascript, SQL, and C Like.

from pygments import lex
from pygments.token import Token as ParseToken

def strip_comments(replace_query, lexer):
    generator = lex(replace_query, lexer)
    line = []
    lines = []
    for token in generator:
        token_type = token[0]
        token_text = token[1]
        if token_type in ParseToken.Comment:
            continue
        line.append(token_text)
        if token_text == '\n':
            lines.append(''.join(line))
            line = []
    if line:
        line.append('\n')
        lines.append(''.join(line))
    strip_query = "\n".join(lines)
    return strip_query

Working with C like languages:

from pygments.lexers.c_like import CLexer

strip_comments("class Bla /*; complicated // stuff */ example; // out",CLexer())
# 'class Bla  example; \n'

Working with SQL languages:

from pygments.lexers.sql import SqlLexer

strip_comments("select * /* this is cool */ from table -- more comments",SqlLexer())
# 'select *  from table \n'

Working with Javascript Like Languages:

from pygments.lexers.javascript import JavascriptLexer
strip_comments("function cool /* not cool*/(x){ return x++ } /** something **/ // end",JavascriptLexer())
# 'function cool (x){ return x++ }  \n'

Since this code only removes the comments, any strange value will remain. So, this is a very robust solution that is able to deal even with invalid inputs.

Thiago Mata
  • 2,825
  • 33
  • 32
  • It's been some time since this answer was posted, but I just wanted to say that I found it extremely useful. I've been experimenting with Thiago's solution above, but wanted to note that if you're parsing C code you may want to use the following import instead of the one leveraging pygments.lexers.c_like: `from pygments.lexers.c_cpp import CLexer`. I'm still experimenting with this, but using the former discarded pre-processor definitions for me. – Michael Donahue Jul 09 '21 at 16:51
  • [Here's a link to the lexers available](https://pygments.org/docs/lexers/#:~:text=in%20version%201.5.-,Lexers%20for%20C/C%2B%2B%20languages,platform%20SDK%20headers%20(e.g.%20clockid_t%20on%20Linux).%20(default%3A%20True).,-class%20pygments.lexers) – Michael Donahue Jul 09 '21 at 17:20
1

I'm sorry this not a Python solution, but you could also use a tool that understands how to remove comments, like your C/C++ preprocessor. Here's how GNU CPP does it.

cpp -fpreprocessed foo.c
sigjuice
  • 28,661
  • 12
  • 68
  • 93
1

There is also a non-python answer: use the program stripcmt:

StripCmt is a simple utility written in C to remove comments from C, C++, and Java source files. In the grand tradition of Unix text processing programs, it can function either as a FIFO (First In - First Out) filter or accept arguments on the commandline.

hlovdal
  • 26,565
  • 10
  • 94
  • 165
1

The following worked for me:

from subprocess import check_output

class Util:
  def strip_comments(self,source_code):
    process = check_output(['cpp', '-fpreprocessed', source_code],shell=False)
    return process 

if __name__ == "__main__":
  util = Util()
  print util.strip_comments("somefile.ext")

This is a combination of the subprocess and the cpp preprocessor. For my project I have a utility class called "Util" that I keep various tools I use/need.

0

You don't really need a parse tree to do this perfectly, but you do in effect need the token stream equivalent to what is produced by the compiler's front end. Such a token stream must necessarilyy take care of all the weirdness such as line-continued comment start, comment start in string, trigraph normalization, etc. If you have the token stream, deleting the comments is easy. (I have a tool that produces exactly such token streams, as, guess what, the front end of a real parser that produces a real parse tree :).

The fact that the tokens are individually recognized by regular expressions suggests that you can, in principle, write a regular expression that will pick out the comment lexemes. The real complexity of the set regular expressions for the tokenizer (at least the one we wrote) suggests you can't do this in practice; writing them individually was hard enough. If you don't want to do it perfectly, well, then, most of the RE solutions above are just fine.

Now, why you would want strip comments is beyond me, unless you are building a code obfuscator. In this case, you have to have it perfectly right.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
-2

I ran across this problem recently when I took a class where the professor required us to strip javadoc from our source code before submitting it to him for a code review. We had to do this several times, but we couldn't just remove the javadoc permanently because we were required to generate javadoc html files as well. Here is a little python script I made to do the trick. Since javadoc starts with /** and ends with */, the script looks for these tokens, but the script can be modified to suite your needs. It also handles single line block comments and cases where a block comment ends but there is still non-commented code on the same line as the block comment ending. I hope this helps!

WARNING: This scripts modifies the contents of files passed in and saves them to the original files. It would be wise to have a backup somewhere else

#!/usr/bin/python
"""
 A simple script to remove block comments of the form /** */ from files
 Use example: ./strip_comments.py *.java
 Author: holdtotherod
 Created: 3/6/11
"""
import sys
import fileinput

for file in sys.argv[1:]:
    inBlockComment = False
    for line in fileinput.input(file, inplace = 1):
        if "/**" in line:
            inBlockComment = True
        if inBlockComment and "*/" in line:
            inBlockComment = False
            # If the */ isn't last, remove through the */
            if line.find("*/") != len(line) - 3:
                line = line[line.find("*/")+2:]
            else:
                continue
        if inBlockComment:
            continue
        sys.stdout.write(line)
slottermoser
  • 154
  • 1
  • 5
  • That surely fails if there is a `//` or `/*` within a string, or within a `/` delimited regular expression. – robocat Apr 26 '13 at 02:35
  • No it doesn't. It's looking for `/** */` style java block comments, as stated in the description. It doesn't handle `//` or `/*` or even `/`... it isn't perfect, but it doesn't "fail", just ignores the cases you stated. It was just a reference for anyone looking for something similar. – slottermoser Jun 01 '13 at 23:39