4

I know this question has been asked before (e.g.: see Remove comments from C/C++ code), but I haven't found any satisfiable result.

I am parsing a set of complex C/C++ code that first must be normalized, which includes eliminating comments from the input source code.

All decommenting tools I have tried failed to a certain degree, and that includes:

  • decomment
  • stripcmt
  • cloc

Note: I have also tried "gcc -fpreprocessed -E", but it does not lead to a perfect result; the output has some weird macro annotations for keeping track of certain lines of code.

To illustrate the problem with a particular tool (cloc), removing comments from this header file leads to removing non-comments as well, such as all the includes in the begining of that file.

That said, is there any reliable tool for comment removal that can be used in stripping out comments in exceptionally complex code?

Much appreciated.

leco
  • 1,989
  • 16
  • 30
  • 1
    As Dana Robinson noted in a comment to [this answer](http://stackoverflow.com/a/2394040/315052), you can add the `-P` flag as well to suppress the line number tracking annotations. – jxh Aug 12 '13 at 22:34
  • True... using -P seems to fix the problem :) – leco Aug 12 '13 at 23:00
  • If you are parsing C/C++, why is lexing/tossing out comments difficult? This should be a piece of cake at the lexical level. Are you *really* parsing C++? [Check my bio for prettyprinters that can eliminate comments] – Ira Baxter Aug 12 '13 at 23:27
  • Ira, I would be careful with your saying, because all three tools I mentioned before are pretty standard, but all of them failed. Although not the hardest problem on Earth, creating an automaton takes time (unless using jflex). To answer your question, parsing is done after decommenting. – leco Aug 12 '13 at 23:53
  • Why don't you write a simple shell script that parses the comments out. It should be simple since you are only looking for two things. From a double slash (//) to a line feed, and from a (/*) until (*/). – Charles D Pantoga Aug 12 '13 at 23:56
  • Nope, these characters can exist inside strings, and thus need to be tracked as well. When handling strings, one also has to handle escape characters... – leco Aug 12 '13 at 23:59
  • Using gcc in the end turned out to be a simple and fast solution, instead of having to create my jflex spec to handle that. – leco Aug 13 '13 at 00:00
  • 1
    Why do you ask? Perhaps customizing GCC e.g. with [MELT](http://gcc-melt.org/) could be useful! – Basile Starynkevitch Aug 13 '13 at 06:11
  • @LeonardoPassos: Ira is one of the most knowledgeable persons when it comes to parsing. His (commercial) tools can do far more advanced transforms. – MSalters Aug 13 '13 at 08:19
  • @MSalters: please don't take this the wrong way... I have a good knowledge on parsing as well, and even have built a compiler-compiler tool myselft. But that is not the point here... I was just deeply surprised that standard tools could not do the job, and wanted a quick and fast solution for the problem – leco Aug 13 '13 at 15:43
  • @Basile: the point of asking was to see if there is any reliable tool available. Thanks for pointing Melt, did not know it and it seems quite useful :) – leco Aug 13 '13 at 15:44

2 Answers2

3
#!/bin/bash

if [[ "$#" != 1 ]] ; then
  echo "Usage: stripcomments input-file" > /dev/stderr
  exit
fi

gcc -fpreprocessed -dD -E -P "$1" 2> /dev/null
leco
  • 1,989
  • 16
  • 30
-1

You could remove everything after // until the EOL, and /* to */ with a couple regexs if you wanted...

For single line comments, you could use: \/\/(.*)

For multi-line comments, this: \/\*(.*)\*\/

MarcusJ
  • 149
  • 4
  • 12
  • And if the program has a string "//abc" your scheme damages the string. – Ira Baxter Aug 11 '16 at 02:47
  • What do you mean by that? single line comments continue till the end of the line, so what exactly is it damaging? – MarcusJ Aug 11 '16 at 06:00
  • 1
    A dumb regex (yours, sorry) will see the characters // in a C literal string, and think they start a comment when they are just part of the literal string. This is why you need to build a lexer for the language, carefully, and not just hack your way around with oversimplified regexes. – Ira Baxter Aug 11 '16 at 07:51
  • No need to be sorry, I wrote them in seconds, and I tried using \w at the end but it messed it up and I was tired of writing out the answer to be completely honest. I didn't think of that case, I was like "well C comments always end at EOL so it'll work" without thinking about comments, good catch. – MarcusJ Aug 11 '16 at 10:01
  • Without thinking about comments in strings* – MarcusJ Aug 11 '16 at 17:11