87

Is there an easy way to remove comments from a C/C++ source file without doing any preprocessing. (ie, I think you can use gcc -E but this will expand macros.) I just want the source code with comments stripped, nothing else should be changed.

EDIT:

Preference towards an existing tool. I don't want to have to write this myself with regexes, I foresee too many surprises in the code.

Mike
  • 58,961
  • 76
  • 175
  • 221
  • 5
    This is actually a good exercise for using a simple lexer and parser! – Greg Hewgill Mar 06 '10 at 20:32
  • Do we have to expect any tricks like `/*` inside comments or strings? – Pascal Cuoq Mar 06 '10 at 20:32
  • 63
    This is actually a good exercise for using a very complicated lexer and parser. –  Mar 06 '10 at 20:33
  • @pascal yes, I'm expecting lot's of tricks. I don't want to have to make any assumptions – Mike Mar 06 '10 at 20:34
  • @Neil: that's what makes it a good exercise. :) – Greg Hewgill Mar 06 '10 at 20:37
  • If http://www.drdobbs.com/cpp/184401344 is to be believed, you can't remove comments (expecting tricks) without expanding macros. – Pascal Cuoq Mar 06 '10 at 20:40
  • @Mike are you seriously downvoting people who try to help you? – stacker Mar 06 '10 at 20:46
  • @stacker no, I only downvoted one comment so far and it linked to terrible code – Mike Mar 06 '10 at 20:47
  • 4
    @Pascal: I don't believe Dr. Dobbs, and gcc agrees: `error: pasting "/" and "/" does not give a valid preprocessing token` -which is expected, as comment removal happens before preprocessing – Christoph Mar 06 '10 at 20:52
  • 1
    @Neil:actually, it only requires a lexer, not a parser at all. A while a C++ lexer is more complex than most, it's still not terribly difficult. Don't get me wrong: I'm not particularly recommending it over using an existing tool -- but if parsing was required, it would be *drastically* more difficult than it really is. – Jerry Coffin Mar 06 '10 at 22:37
  • @Jerry The preprocessor has semantics, particularly regarding comments. Thus pre-processing requires a parser. This is why all modern C and C++ compilers build the preprocessor into the compiler. –  Mar 06 '10 at 22:49
  • 2
    @Neil:sorry, but no. A parser deals with the structure of statements. From the viewpoint of the language, a comment is a single token that does not participate in any larger structure. It's no different from a space character (in fact, in phase three of translation, each comment is to be replaced by a single space character). As for building the preprocessor into the compiler, the explanation is much simpler: the preprocessor often produces very *large* output, so communicating it to the compiler efficiently improves compilation speed a lot. – Jerry Coffin Mar 06 '10 at 23:23
  • @Jerry I refute you thus - /* ... */ comments cannot be dealt with via a simple lexer. You seem to be conflating the language with the implementation. –  Mar 06 '10 at 23:40
  • 1
    @Neil:Sorry, but no. Yes, C-style comments can be handled by a lexer. I'm not conflating the language with the implementation: I'm simply telling you what I know from experience -- I've *written* a lexer for C and C++ that deals with both styles of comments perfectly well. While that was non-trivial by most standards, compared to a C++ parser, "trivial" is *exactly* what it is. – Jerry Coffin Mar 06 '10 at 23:56
  • @Jerry Like you, I have written a lexer (and a compiler) for C. And like you I know what a lexer does - it produces a stream of tokens (or lexemes, if we want to be pedantic). What a lexer does not do is perform semantic analysis, which is what is required to handle block comments. This is my final post on this subject. –  Mar 07 '10 at 00:16
  • 7
    @Neil: Perhaps that's best -- you seem to be just repeating the same assertion, with no supporting evidence. You haven't even once pointed to what semantic analysis you think is needed to parse comments correctly, just repeated that it is (which the standard not only doesn't require, but doesn't really even allow). You substitute trigraphs, splice lines, then break the source into tokens and sequences of white space (including comments). If you try to take more semantics into account than that, you're doing it wrong... – Jerry Coffin Mar 07 '10 at 00:28
  • Isn't our goal usually to get more comments in code? – brian beuning Jul 16 '14 at 02:59
  • [my answer](https://stackoverflow.com/a/53551634/3625404) handles all practical cases. It works perfectly, as long as `/*`,`//`,`*/` don't split in two lines. Which is essentially a state machine with states: 1 part of string literal, 2 part of C style comment, 3 part of C++ style comment, 4 other. Handling line-continuation too. – qeatzy Nov 30 '18 at 10:32
  • See https://stackoverflow.com/a/13062682/1745001 for how to really do this robustly (and simply). – Ed Morton Jun 30 '19 at 17:45

12 Answers12

124

Run the following command on your source file:

gcc -fpreprocessed -dD -E test.c

Thanks to KennyTM for finding the right flags. Here’s the result for completeness:

test.c:

#define foo bar
foo foo foo
#ifdef foo
#undef foo
#define foo baz
#endif
foo foo
/* comments? comments. */
// c++ style comments

gcc -fpreprocessed -dD -E test.c:

#define foo bar
foo foo foo
#ifdef foo
#undef foo
#define foo baz
#endif
foo foo
evandrix
  • 6,041
  • 4
  • 27
  • 38
Josh Lee
  • 171,072
  • 38
  • 269
  • 275
  • 4
    I think the result Mike expects is `#define foo bar\nfoo foo foo` – Pascal Cuoq Mar 06 '10 at 20:45
  • 4
    @Pascal: Run `gcc -fpreprocessed -dM -E test.c` to get the `#define`-s as well, but they're not in the original locations. – kennytm Mar 06 '10 at 20:49
  • 17
    I added -P to the gcc options to suppress the weird line markers that sometimes show up when our start of function comments are removed. – Dana Robinson Oct 26 '12 at 17:36
  • 2
    I also needed to add -P to get usable output. – James Johnston Dec 09 '15 at 20:34
  • Since `-fpreprocessed` suppresses line splicing, this method fails if a comment line is concatenated with the following line with a trailing `\ `. – jxh Jan 18 '16 at 22:56
  • 1
    I just tried it and it and it inlined the `#include`d files and replaced the commented lines with blank lines rather than deleting the comments. FWIW a combination of sed and gcc have always worked perfectly for me, see http://stackoverflow.com/a/13062682/1745001. – Ed Morton Feb 29 '16 at 19:46
  • 1
    `-fpreprocessed` is not available on clang – noɥʇʎԀʎzɐɹƆ Jan 10 '17 at 16:09
  • for doing it in place: `gcc -fpreprocessed -dD -E -P -o test.c.tmp test.c && mv test.c.tmp test.c`. I use `mv` because `gcc` refuse to override input. – idanp May 12 '18 at 07:20
  • A handy shortcut if you need to compare different revisions of a codebase: `find . -type f -name "*.c" -or -name "*.h" -exec gcc -fpreprocessed -dD -E -P {} \; > all_code.out` or just calc hash: `find . -type f -name "*.c" -or -name "*.h" -exec gcc -fpreprocessed -dD -E -P {} \; 2>/dev/null | sha512sum` – Sparkler Dec 04 '20 at 03:08
18

It depends on how perverse your comments are. I have a program scc to strip C and C++ comments. I also have a test file for it, and I tried GCC (4.2.1 on MacOS X) with the options in the currently selected answer - and GCC doesn't seem to do a perfect job on some of the horribly butchered comments in the test case.

NB: This isn't a real-life problem - people don't write such ghastly code.

Consider the (subset - 36 of 135 lines total) of the test case:

/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.

/\
\/ This is not a C++/C99 comment!

This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.

/\
\* This is not a C or C++ comment!

This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.

This is followed by regular C comment number 3.
/\
\
\
\
* C comment */

On my Mac, the output from GCC (gcc -fpreprocessed -dD -E subset.c) is:

/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.

/\
\/ This is not a C++/C99 comment!

This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.

/\
\* This is not a C or C++ comment!

This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.

This is followed by regular C comment number 3.
/\
\
\
\
* C comment */

The output from 'scc' is:

The regular C comment number 1 has finished.

/\
\/ This is not a C++/C99 comment!

This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.

/\
\* This is not a C or C++ comment!

This is followed by regular C comment number 2.

The regular C comment number 2 has finished.

This is followed by regular C comment number 3.

The output from 'scc -C' (which recognizes double-slash comments) is:

The regular C comment number 1 has finished.

/\
\/ This is not a C++/C99 comment!

This is followed by C++/C99 comment number 3.

The C++/C99 comment number 3 has finished.

/\
\* This is not a C or C++ comment!

This is followed by regular C comment number 2.

The regular C comment number 2 has finished.

This is followed by regular C comment number 3.

Source for SCC now available on GitHub

The current version of SCC is 6.60 (dated 2016-06-12), though the Git versions were created on 2017-01-18 (in the US/Pacific time zone). The code is available from GitHub at https://github.com/jleffler/scc-snapshots. You can also find snapshots of the previous releases (4.03, 4.04, 5.05) and two pre-releases (6.16, 6.50) — these are all tagged release/x.yz.

The code is still primarily developed under RCS. I'm still working out how I want to use sub-modules or a similar mechanism to handle common library files like stderr.c and stderr.h (which can also be found in https://github.com/jleffler/soq).

SCC version 6.60 attempts to understand C++11, C++14 and C++17 constructs such as binary constants, numeric punctuation, raw strings, and hexadecimal floats. It defaults to C11 mode operation. (Note that the meaning of the -C flag — mentioned above — flipped between version 4.0x described in the main body of the answer and version 6.60 which is currently the latest release.)

evandrix
  • 6,041
  • 4
  • 27
  • 38
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
8

gcc -fpreprocessed -dD -E did not work for me but this program does it:

#include <stdio.h>

static void process(FILE *f)
{
 int c;
 while ( (c=getc(f)) != EOF )
 {
  if (c=='\'' || c=='"')            /* literal */
  {
   int q=c;
   do
   {
    putchar(c);
    if (c=='\\') putchar(getc(f));
    c=getc(f);
   } while (c!=q);
   putchar(c);
  }
  else if (c=='/')              /* opening comment ? */
  {
   c=getc(f);
   if (c!='*')                  /* no, recover */
   {
    putchar('/');
    ungetc(c,f);
   }
   else
   {
    int p;
    c = 0;
    putchar(' ');               /* replace comment with space */
    do
    {
     p=c;
     c=getc(f);
    } while (c!='/' || p!='*');
   }
  }
  else
  {
   putchar(c);
  }
 }
}

int main(int argc, char *argv[])
{
 process(stdin);
 return 0;
}
librik
  • 3,738
  • 1
  • 19
  • 20
lhf
  • 70,581
  • 9
  • 108
  • 149
7

There is a stripcmt program than can do this:

StripCmt is a simple utility written in C to remove comments from C, C++, and Java source files. In the grand tradition of Unix text processing programs, it can function either as a FIFO (First In - First Out) filter or accept arguments on the command line.

(per hlovdal's answer to: question about Python code for this)

Community
  • 1
  • 1
che
  • 12,097
  • 7
  • 42
  • 71
  • 1
    The code still has some bugs. For example, it cannot handle code like `int /* comment // */ main()`. – pynexj Jul 16 '14 at 02:51
  • and have bugs when handling comments like `// comment out next line \ ` – sleepsort Sep 04 '14 at 15:53
  • [my answer](https://stackoverflow.com/a/53551634/3625404) handles these cases. It works perfectly, as long as `/*`,`//`,`*/` don't split in two lines. – qeatzy Nov 30 '18 at 10:31
4

This is a perl script to remove //one-line and /* multi-line */ comments

  #!/usr/bin/perl

  undef $/;
  $text = <>;

  $text =~ s/\/\/[^\n\r]*(\n\r)?//g;
  $text =~ s/\/\*+([^*]|\*(?!\/))*\*+\///g;

  print $text;

It requires your source file as a command line argument. Save the script to a file, let say remove_comments.pl and call it using the following command: perl -w remove_comments.pl [your source file]

Hope it will be helpful

Vladimir
  • 41
  • 2
3

I had this problem as well. I found this tool (Cpp-Decomment) , which worked for me. However it ignores if the comment line extends to next line. Eg:

// this is my comment \
comment continues ...

In this case, I couldn't find a way in the program so just searched for ignored lines and fixed in manually. I believe there would be an option for that or maybe you could change the program's source file to do so.

Halil
  • 2,076
  • 1
  • 22
  • 30
2

Because you use C, you might want to use something that's "natural" to C. You can use the C preprocessor to just remove comments. The examples given below work with the C preprocessor from GCC. They should work the same or in similar ways with other C perprocessors as well.

For C, use

cpp -dD -fpreprocessed -o output.c input.c

It also works for removing comments from JSON, for example like this:

cpp -P -o - - <input.json >output.json

In case your C preprocessor is not accessible directly, you can try to replace cpp with cc -E, which calls the C compiler telling it to stop after the preprocessor stage. In case your C compiler binary is not cc you can replace cc with the name of your C compiler binary, for example clang. Note that not all preprocessors support -fpreprocessed.

Christian Hujer
  • 17,035
  • 5
  • 40
  • 47
1

I write a C program using standard C library, around 200 lines, which removes comments of C source code file. qeatzy/removeccomments

behavior

  1. C style comment that span multi-line or occupy entire line gets zeroed out.
  2. C style comment in the middle of a line remain unchanged. eg, void init(/* do initialization */) {...}
  3. C++ style comment that occupy entire line gets zeroed out.
  4. C string literal being respected, via checking " and \".
  5. handles line-continuation. If previous line ending with \, current line is part of previous line.
  6. line number remain the same. Zeroed out lines or part of line become empty.

testing & profiling

I tested with largest cpython source code that contains many comments. In this case it do the job correctly and fast, 2-5 faster than gcc

time gcc -fpreprocessed -dD -E Modules/unicodeobject.c > res.c 2>/dev/null
time ./removeccomments < Modules/unicodeobject.c > result.c

usage

/path/to/removeccomments < input_file > output_file
qeatzy
  • 1,363
  • 14
  • 21
0

I Believe If you use one statement you can easily remove Comments from C

perl -i -pe ‘s/\\\*(.*)/g’ file.c This command Use for removing * C style comments 
perl -i -pe 's/\\\\(.*)/g' file.cpp This command Use for removing \ C++ Style Comments

Only Problem with this command it cant remove comments that contains more than one line.but by using this regEx you can easily implement logic for Multiline Removing comments

0

Recently I wrote some Ruby code to solve this problem. I have considered following exceptions:

  • comment in strings
  • multiple line comment on one line, fix greedy match.
  • multiple lines on multiple lines

Here is the code:

It uses following code to preprocess each line in case those comments appear in strings. If it appears in your code, uh, bad luck. You can replace it with a more complex strings.

  • MUL_REPLACE_LEFT = "MUL_REPLACE_LEFT"
  • MUL_REPLACE_RIGHT = "MUL_REPLACE_RIGHT"
  • SIG_REPLACE = "SIG_REPLACE"

USAGE: ruby -w inputfile outputfile

evandrix
  • 6,041
  • 4
  • 27
  • 38
chunyang.wen
  • 204
  • 1
  • 9
-1

I know it's late, but I thought I'd share my code and my first attempt at writing a compiler.

Note: this does not account for "\*/" inside a multiline comment e.g /\*...."*/"...\*. Then again, gcc 4.8.1 doesn't either.

void function_removeComments(char *pchar_sourceFile, long long_sourceFileSize)
{
    long long_sourceFileIndex = 0;
    long long_logIndex = 0;

    int int_EOF = 0;

    for (long_sourceFileIndex=0; long_sourceFileIndex < long_sourceFileSize;long_sourceFileIndex++)
    {
        if (pchar_sourceFile[long_sourceFileIndex] == '/' && int_EOF == 0)
        {
            long_logIndex = long_sourceFileIndex;  // log "possible" start of comment

            if (long_sourceFileIndex+1 < long_sourceFileSize)  // array bounds check given we want to peek at the next character
            {
                if (pchar_sourceFile[long_sourceFileIndex+1] == '*') // multiline comment
                {
                    for (long_sourceFileIndex+=2;long_sourceFileIndex < long_sourceFileSize; long_sourceFileIndex++)
                    {
                        if (pchar_sourceFile[long_sourceFileIndex] == '*' && pchar_sourceFile[long_sourceFileIndex+1] == '/')
                        {
                            // since we've found the end of multiline comment
                            // we want to increment the pointer position two characters
                            // accounting for "*" and "/"
                            long_sourceFileIndex+=2;  

                            break;  // terminating sequence found
                        }
                    }

                    // didn't find terminating sequence so it must be eof.
                    // set file pointer position to initial comment start position
                    // so we can display file contents.
                    if (long_sourceFileIndex >= long_sourceFileSize)
                    {
                        long_sourceFileIndex = long_logIndex;

                        int_EOF = 1;
                    }
                }
                else if (pchar_sourceFile[long_sourceFileIndex+1] == '/')  // single line comment
                {
                    // since we know its a single line comment, increment file pointer
                    // until we encounter a new line or its the eof 
                    for (long_sourceFileIndex++; pchar_sourceFile[long_sourceFileIndex] != '\n' && pchar_sourceFile[long_sourceFileIndex] != '\0'; long_sourceFileIndex++);
                }
            }
        }

        printf("%c",pchar_sourceFile[long_sourceFileIndex]);
     }
 }
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
johnny
  • 258
  • 2
  • 12
  • I'm curious about your "doesn't handle" comment. I can't make out what you think it doesn't handle. Note that once `/*` has been processed, the next unspaced character sequence `*/` terminates the comment; there are no escape mechanisms inside a comment — which may be what you mean by GCC not handling it either. Your code has problems with `"/* Magritte notes: Ceci n'est pas une commentaire */"` (because it is a string literal, not a comment — but he was talking about pipes, not comments). – Jonathan Leffler Jan 28 '16 at 03:42
-3
#include<stdio.h>
{        
        char c;
        char tmp = '\0';
        int inside_comment = 0;  // A flag to check whether we are inside comment
        while((c = getchar()) != EOF) {
                if(tmp) {
                        if(c == '/') {
                                while((c = getchar()) !='\n');
                                tmp = '\0';
                                putchar('\n');
                                continue;
                        }else if(c == '*') {
                                inside_comment = 1;
                                while(inside_comment) {
                                        while((c = getchar()) != '*');
                                        c = getchar();
                                        if(c == '/'){
                                                tmp = '\0';
                                                inside_comment = 0;
                                        }
                                }
                                continue;
                        }else {
                                putchar(c);
                                tmp = '\0';
                                continue;
                        }
                }
                if(c == '/') {
                        tmp = c;
                } else {
                        putchar(c);
                }
        }
        return 0;
}

This program runs for both the conditions i.e // and /...../

  • 6
    Several problems. 1. You're missing `int main(void)`. 2. It doesn't handle comment delimiters inside string literals and character constants. 3. It deletes single `/` character (try running it on its own source code). – Keith Thompson Aug 19 '13 at 20:41