Distinguish commented code vs valid comments

Question

I have to work with a project that has tons of commented code everywhere. Before I introduce any changes I would like to do a basic clean-up and remove old unused code.

So I could just use solution from this accepted answer to remove all comments, but...

There are legitimate comments (not a commented code) that explain stuff. I don't want to remove it. For example:

// Those parameters control foo and bar... <- valid comment
int t = 5;
// int t = 10;  <- commented code
int k = 2*t;

Only line 3 should be removed.

What are the possible ways of analyzing the code and distinguish between comments in natural language and commented lines of code?

Maybe you can rig something with your version control system and remove any line to which `//` was prepended. — François Andrieux, Jan 10 '19 at 15:49
Can you give more details about the commented codes scope? Could they contain control expressions or they're just statements? — ndrwnaguib, Jan 10 '19 at 15:54
@FrançoisAndrieux Well, this could be a good direction to go, but it requires some advanced mangling with git. I was rather thinking about checking if each commented line is a syntacticly correct code, if so, it can be removed. — hans, Jan 10 '19 at 15:59
Sorry, tool recommendations questions are not for stackoverflow. — Max Langhof, Jan 10 '19 at 15:59
@hans Whether a given line is syntactically correct C++ code or not generally requires parsing the entire program. You could technically write some `clang` tool that does this but it's probably way faster to filter comments manually. — Max Langhof, Jan 10 '19 at 16:01
@hans Checking if a given line is proper c++ might not be so easy. You have to assume some of those comments contain complex lines, lines that rely on identifiers that don't exist anymore or even code that contains errors and could not be identified as valid c++. — François Andrieux, Jan 10 '19 at 16:02
@AndrewNaguib it can be anything, even preprocessor directives. — hans, Jan 10 '19 at 16:03
Man, I hate when people leave dead code lying around. That's what version control is for! Good on you for trying to clean it up. But, um, good luck ::( — Lightness Races in Orbit, Jan 10 '19 at 16:13
@LightnessRacesinOrbit It drives me crazy too. Programming is an art, not a sh**ing with whatever works. I will probably have to deal with more projects like this, so a handy tool would be useful. — hans, Jan 10 '19 at 16:46
@hans You absolutely will. However a better approach would be to catch things like this during peer review, rather than letting it get this bad (because you'll be at this forever and I doubt a tool will ever get the job done quite frankly). Of course you are not always part of a project from the start though. — Lightness Races in Orbit, Jan 10 '19 at 16:47
I occasionally use pseudocode in a comment, and sometimes that pseudocode looks very much like code. For example, what would be a one-liner in python or R might expand to many lines (sometimes dozens of lines) in C or C++. Those one-liner comments need to remain, even if they look exactly like code. — David Hammen, Jan 10 '19 at 17:33
@LightnessRacesinOrbit - Hate that too. But as you wrote, um, good luck. This is a very hard problem. Detecting code commented out with `#if 0 ... #endif` is almost trivial. (Almost trivial because of the `#if 1 ... #else ... #endif` problem.) OTOH, detecting code commented out with comments is highly nontrivial. — David Hammen, Jan 10 '19 at 17:43
@DavidHammen Indeed, to the extent that personally I wouldn't even try to do it programmatically. I'd just clean things up as I encounter them, over time. — Lightness Races in Orbit, Jan 10 '19 at 17:47
Do be aware that even real C++ (as opposed to pseudo-code) can be "valid" in a comment - it might not be dead code, just a documenting explanation that there may be another way to do something, with its own pros/cons, which could be considered some day (along with some prose to that effect). That's not automatically worthy of deletion. You're really going to struggle to get a computer to make those decisions for you, in general. — Lightness Races in Orbit, Jan 10 '19 at 17:48

ndrwnaguib · Accepted Answer · 2019-01-12T23:55:57.917

This is a basic approach, but it proposes a proof of concept of what might be done. I do it using Bash along with the usage of the GCC -fsyntax-only option.

Here is the bash script:

#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
    LINE=`echo $line | grep -oP "(?<=//).*"`
    if [[ -n "$LINE" ]]; then
            echo $LINE | gcc -fsyntax-only -xc -
            if [[ $? -eq 0 ]]; then
                   sed -i "/$LINE/d" ./$1
            fi
    fi
done < "$1"

The approach I followed here was reading each line from the code file. Then, greping the text after the // delimiter (if exists) with the regex (?<=//).* and passing that to the gcc -fsyntax-only command to check whether it's a correct C/C++ statement or not. Notice that I've used the argument -xc - to pass the input to GCC from stdin (see my answer here to understand more). An important note, the c in -xc - specifies the language, which is C in this case, if you want it to be C++ you shall change it to -xc++.

Then, if GCC was able to successfully parse the statement (i.e., it's a legitimate C/C++ statement), I directly remove it using sed -i from the file passed.

Running it on your example (but after removing <- commented code from the third line to make it a legitimate statement):

// Those parameters control foo and bar... <- valid comment
int t = 5;
// int t = 10;
int k = 2*t;

Output (in the same file):

// Those parameters control foo and bar... <- valid comment
int t = 5;
int k = 2*t;

(if you want to add your modifications in a different file, just remove the -i from sed -i)

The script can be called just like: ./script.sh file.cpp, it may show several GCC errors since these are the valid comments.

Update.

A more simplified version of the same logic is:

#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
    if [[ "$line" =~  [/]+.* ]]; then
        $LINE=${line##*\/}
        echo ${$LINE} | gcc -fsyntax-only -xc - && sed -i "/$LINE/d" ./$1
    fi
done < "$1"

Max Langhof · Answer 2 · 2019-01-10T16:13:01.023

You could get most of the way there with some simple regular expression stuff. Basically, a line is most likely not code if:

It starts with some or no whitespace,
followed by //,
followed by some text that contains only whitespace, letters, numbers and basic punctuation,
and does not end with a ;.

You can write a regex for the above combination (or its inverse) and get an overview over how many actual candidates for removal there are. In 100k lines there's probably less than 1k lines that match this simple filter, and that's definitely in the "can go through it manually" range.

I'd most likely start by grepping for lines matching \w*//.*;, look through the results and confirm that all of them can be deleted. False positive number here should be extremely low. Note that this won't catch multiline-statements that were commented out.

Distinguish commented code vs valid comments

2 Answers2

Update.