Reading a C source file and skipping /**/ comments

Question

I managed to write code to skip // comments in C source:

while (fgets(string, 10000, fin) != NULL)
{
    unsigned int i;
    for (i = 0; i < strlen(string); i++)
    {
        if ((string[i] == '/') && (string[i + 1] == '/'))
        {
            while (string[i += 1] != '\n')
                continue;
        } 
    //rest of the code...

I've tried to do similar thing for /**/ comments:

if ((string[i] == '/') && (string[i + 1] == '*'))
{
    while (string[i += 1] != '/')
        continue;
}

if ((string[i] == '*') && (string[i + 1] == '/'))
{
    while (string[i -= 1])
        continue;
}

But it reads line by line and if I have, for example,

/*

text*/

then it counts the text.

How do I fix this?

save the state to a variable and test for it in the following iterations. — Iharob Al Asimi, Jan 08 '15 at 19:12
The `string[i += 1]` notation is an aconventional way of writing `string[i++]`. Also, the test for newline is modestly pointless; `fgets()` read a line, but only one line, so the comment continues to the end of the string. I won't bore you with all the special cases your code doesn't handle (`"/* not a comment */"`, `"// not a comment"`, backslashes at the ends of lines, trigraphs, etc.). There are other (multiple other) questions on this topic. Finding a good one to duplicate this too will be harder. — Jonathan Leffler, Jan 08 '15 at 19:40
The C preprocessor will strip all comments correctly. I have a shell script that uses GCC's C preprocessor to remove comments, but it also reformats the program some. — yellowantphil, Jan 08 '15 at 19:44
Amongst other questions on this topic, see: [Remove comments from C/C++ code](http://stackoverflow.com/questions/2394017/) and [Python snippet to remove C and C++ comments](http://stackoverflow.com/questions/241327/python-snippet-to-remove-c-and-c-comments/242107#242107). The second outlines a number of issues that production strength code needs to deal with. — Jonathan Leffler, Jan 08 '15 at 20:24
Just for your amusement (or do I mean 'angst'), I've discovered a new horrid trick for 'this is not a comment even though it looks a bit like one'. `#include <./*some*/header.h>` includes a file `header.h` from a directory `./*some*` (at least with GCC 4.9.1 on Mac OS X 10.10.1). Worse would be `#include <./*some/header.h>` which would look in the directory `./*some` for `header.h`. Both are apt to send naïve C comment parsers off on the wrong track. You should also be wary of `#include ` which does not contain a C++-style comment. I've got some fixup work to do on my code! — Jonathan Leffler, Jan 10 '15 at 19:45

score 3 · Answer 1 · answered Jan 08 '15 at 19:32

3

Even your supposedly-working code has several problems:

It does not recognize any context, so it will treat // appearing within a string constant or within a /* ... */ comment as the beginning of a comment.
In the unlikely event that you happen to have very long lines, they will be truncated (including their terminating newlines).

In the end, C is a stream-oriented language, not a line-oriented language. It should be parsed that way (character by character). To do the job right, you really need to implement a much more sophisticated parser. If you're up for learning a new tool, then you could consider basing your program on the Flex lexical analyzer.

answered Jan 08 '15 at 19:32

John Bollinger

160,171
8
81
157

to strip only comments he doesn't need a complete C parser. Actually, comments are stripped at preprocessor phase commonly. – Luis Colorado Jan 09 '15 at 05:45
@LuisColorado: no, he doesn't need a complete C parser. I didn't say he did. He certainly *does* need something sophisticated, though: it needs to be able to recognize enough C syntactic constructs to be able to tell when comment delimiters function as such, and when not. – John Bollinger Jan 09 '15 at 14:56

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

A simple regular expression for a C comment is:

/\*([^\*]|\*[^\/])*\*\//

(Sorry for the escape characters) This allows any sequence inside a comment except */. It translates to the following DFA (four states):

state 0, input /, next state 1, output none
state 0, input other, next state 0, output read char
state 1, input *, next state 2, no output
state 1, input /, next state 1, output /
state 1, input other, next state 0, output / and read char
state 2, input *, next state 3, output none
state 2, input other, next state 3, output none
state 3, input /, next state 0, output none
state 3, input *, next state 3, output none
state 3, input other, next state 2, output none

The possible inputs are /, * and any other character. The possible outputs are output read char, output / and output *.

This translates to the following code:

file uncomment.c:

#include <stdio.h>

int main()
{
    int c, st = 0;
    while ((c = getchar()) != EOF) {
        switch (st) {
        case 0: /* initial state */
            switch (c) {
            case '/': st = 1; break;
            default: putchar(c); break;
            } /* switch */
            break;
        case 1: /* we have read "/" */
            switch (c) {
            case '/': putchar('/'); break;
            case '*': st = 2; break;
            default: putchar('/'); putchar(c); st = 0; break;
            } /* switch */
            break;
        case 2: /* we have read "/*" */
            switch (c) {
            case '*': st = 3; break;
            default: break;
            } /* switch */
            break;
        case 3: /* we have read "/* ... *" */
            switch (c) {
            case '/': st = 0; break;
            case '*': break;
            default: st = 2; break;
            } /* switch */
            break;
        } /* switch */
    } /* while */
} /* main */

In case you want to exclude both types of comments, we need to switch to a fifth state when receiving a second /, resulting in the following code:

file uncomment2.c:

#include <stdio.h>

int main()
{
    int c, st = 0;
    while ((c = getchar()) != EOF) {
        switch (st) {
        case 0: /* initial state */
            switch (c) {
            case '/': st = 1; break;
            default: putchar(c); break;
            } /* switch */
            break;
        case 1: /* we have read "/" */
            switch (c) {
            case '/': st = 4; break;
            case '*': st = 2; break;
            default: putchar('/'); putchar(c); st = 0; break;
            } /* switch */
            break;
        case 2: /* we have read "/*" */
            switch (c) {
            case '*': st = 3; break;
            default: break;
            } /* switch */
            break;
        case 3: /* we have read "/* ... *" */
            switch (c) {
            case '/': st = 0; break;
            case '*': break;
            default: st = 2; break;
            } /* switch */
            break;
        // in the next line we put // inside an `old' comment
        // to illustrate this special case.  The switch has been put
        // after the comment to show it is not being commented out.
        case 4: /* we have read "// ..." */ switch(c) {
            case '\n': st = 0; putchar('\n'); break;
            } // switch  (to illustrate this kind of comment).
        } /* switch */
    } /* while */
} /* main */

Yes, very good. But what if the comment delimiters appear inside a string literal: `puts("/* ... */")`? Or inside a multi-character char literal? (Ew.) In any case, you've made the same points I did: the source needs to be parsed on a character-by-character basis, and the parsing needs to be more sophisticated than just scanning for the delimiters. — John Bollinger, Jan 09 '15 at 15:10
Your final listed state 'state 3, input other, next state 3, output none' should be 'state 3, input other, next state 2, output none', shouldn't it? Otherwise it prematurely terminates a comment such as `/* any * thing / goes */` (because it remembers that it found a `*` and then when it gets a `/`, it terminates the comment). And, indeed, your code implements the corrected version of the last state, so I've edited the specified DFA to match what was implemented. — Jonathan Leffler, Jan 10 '15 at 20:11
@JonathanLeffler, Thank you for your editing. The code fortunatelly was ok. I checked the code just before posting, but couldn't do the same with the text. Sorry. — Luis Colorado, Jan 12 '15 at 13:20
@JohnBollinger, you are completely right, we have to check for " delimited strings. In the case of constant character literals, I'm afraid none of the `/*`, `*/` and `//` sequences are allowed as character constants. The case of strings is complex, as we have to deal with escaped `\"` inside them also. Either case, the automaton is not too complex and can be derived from this as an exercise to the reader :) — Luis Colorado, Jan 12 '15 at 13:27

Meninx - メネンックス · Answer 3 · 2015-01-08T20:07:26.883

0

This simple code can ignore the comment /* */ ( doesn't treat all the cases for instance writing /* inside a string between quotes for a variable in c code )

#include <stdio.h> 
#include <string.h> 

typedef enum bool // false = 0 and true = 1
{ false,true}bool;
int main(int argc, char *argv[])
{
     FILE* file=fopen("file","r"); // open the file 
     bool comment=false;
     char str[1001]; // string that will contain portion of the file each time     

     if (file!=NULL)
     {
         while (fgets(str,sizeof(str),file)!=NULL)
         {
             int i=0;
             for (i=0;i<strlen(str);i++)
             {
                 if (str[i]=='/' && str[i+1] == '*')
                 {
                     comment=true; // comment true we will ignore till the end of comment
                     i++; // skip the * character 
                 }
                 else if (str[i]=='*' && str[i+1] == '/')
                 {
                     comment=false; 
                     i++; // skip the / character
                 }
                 else if (comment==false)
                 {
                     printf("%c",str[i]); // if the character not inside comment print it
                 }
             }
         }
         fclose(file);
     }

     return 0;
}

edited Jan 08 '15 at 20:07

answered Jan 08 '15 at 19:36

Meninx - メネンックス

6,331
16
30

*"doesn't treat all the cases"* - which cases? – Weather Vane Jan 08 '15 at 19:51
1

Note that you should use `sizeof(str)` as the argument to `fgets()`, and it already knows that if you specify 1001 as the size (via `sizeof(str)`), then it must use the last byte for a terminating null byte. – Jonathan Leffler Jan 08 '15 at 19:52
@WeatherVane: Amongst others, it doesn't handle comment start characters in a string literal (or a multi-character character literal). – Jonathan Leffler Jan 08 '15 at 19:53
@JonathanLeffler I was hoping Meninx would explain that. – Weather Vane Jan 08 '15 at 19:55
@WeatherVane I wasn't aware of that case honestly at the moment of writing the code but after reading the answer of John Bollinger I realized that there are too much cases that need to be treated especially if the file contains a complicated C code :) ! Thanks for both You and Jonathan Leffer ! – Meninx - メネンックス Jan 08 '15 at 20:03

JJoao · Answer 4 · 2015-01-27T19:20:51.010

(It is not very clear what your program is trying to do.)

Using flex to count the number of characters outside comments:

%option noyywrap

%%
   int i = 0;

\"([^\\"]|\\.)*\"          { i += yyleng ; }       // treatment of strings
\/\/.*                     {               }       // C++ comments
\/\*([^*]|\*[^/])*\*\/     {               }       // C  comments
.|\n                       { i += yyleng ; }       // normal chars

<<EOF>>                    { printf("%d\n",i); return;}
%%

int main(){ 
  yylex(); 
  return 0;}

and

$ flex count-non-com.fl
$ cc -o count-non-com lex.yy.c
$ count-non-com < input

One last example: flex code to remove comments (thanks @LuisColorado)

%option noyywrap 
%%

\"([^\\"]|\\.)*\"          { ECHO; }       // treatment of strings
\/\/.*                     {       }       // C++ comments
\/\*([^*]|\*[^/])*\*\/     {       }       // C  comments
.|\n                       { ECHO; }       // normal chars

%%

int main(){ 
  yylex(); 
  return 0;}

@LuisColorado, Thank you! If I understand correctly you edited my code but the edition was rejected. I saw it now and it has some good contributions. I tried to conciliate the 2 versions. — JJoao, Jan 27 '15 at 19:19

score -1 · Answer 5 · answered Jan 08 '15 at 20:12

-1

Make an int variable. Scan the characters and store the index if you get /*. Continue scanning until you get */. If the variable !=0 at that time, then assume this is the closing comment token and ignore the characters in between.

answered Jan 08 '15 at 20:12

user279599

1
2

score -1 · Answer 6 · answered Jan 08 '15 at 21:04

-1

As user279599 just said,use an integer variable as flag,whenever you get '/' & '' consecutively set flag up(flag=1),then flag value remains 1 until get '' & '/' consecutively. Ignore every character when the flag is 1.

answered Jan 08 '15 at 21:04

Kantajit

418
2
6
16

Reading a C source file and skipping /**/ comments

6 Answers6

file uncomment.c:

file uncomment2.c: