3

I'm looking for a way to search for a given term in a project's C/C++ code, while ignoring any occurrences in comments and strings.

As the code base is rather large, i am searching for a way to automatically identify the lines of code matching my search term, as they need manual inspection.

If possible I'd like to perform the search on my linux system.

background

the code base in question is a realtime signal processing engine with a large number of 3rd party plugins. plugins are implemented in a variety of languages (mostly C, but also C++ and others; currently I only care for those two), no standards have been enforced.

our code base currently uses the built-in type float for floating-point numbers and we would like to replace that with a typedef that would allow us to use doubles. we would like to find all occurrences of float in the actual code (ignoring legit uses in comments and printouts).

What complicates things furthermore, is that there are some (albeit few) legit uses of float in the code payload (so we are really looking for a way to identify all places that require manual inspection, rather than run some automatic search-and-replace.)

the code also contains C-style static casts to (float), so relying on compiler warnings to identify type mismatches is often not an option.

the code base consists of more than 3000 (C and C++) files accumulating about 750000 lines of code.

the code is cross-platform (linux, osx, w32 being the main targets; but also freebsd and similar), and is compiled with the various native compilers (gcc/g++, clang/clang++, VisualStudio,...).

so far...

so far I'm using something ugly like:

 grep "\bfloat\b" | sed -e 's|//.*||' -e 's|"[^"]*"||g' | grep "\bfloat\b"

but I'm thinking that there must be some better way to search only payload code.

umläute
  • 28,885
  • 9
  • 68
  • 122
  • 1
    What is your code doing? What is its size? What compiler & platform? Please **edit your question** to improve it. – Basile Starynkevitch Feb 29 '16 at 13:30
  • Depending on how large your codebase is, I might just do it manually with emacs. Otherwise I might just replace all of them and fix the spurious comments later. I'm lazy, though. :) – erip Feb 29 '16 at 13:37
  • 1
    BTW C/C++ does not exist. A given translation unit is coded for C++ (then choose at least the C++11 standard) or for C (choose C11 if possible, or at least C99) – Basile Starynkevitch Feb 29 '16 at 13:40
  • @BasileStarynkevitch while i don't understand why it is of use to know the compilers used for compiling my C and C++ files, i have added that information. i am at a complete loss why the C-standard (or C++-standard) should be of any relevance. – umläute Feb 29 '16 at 14:46
  • 1
    Because if you use [GCC](http://gcc.gnu.org/) you can customize it with [MELT](http://gcc-melt.org/) (but if you use some other compiler you cannot use MELT). Also, there is no such language as C/C++, that was my point. At last, if coding in C++ in 2016, I strongly suggest to code for C++11 at least (which is a very different language than its predecessors) – Basile Starynkevitch Feb 29 '16 at 14:48
  • @BasileStarynkevitch updated the question to hopefully answer those points; hopefully i could also clarify what i meant with "C/C++" – umläute Feb 29 '16 at 14:51
  • My feeling is that MELT is definitely worthwhile using in your case. But being the author of MELT, I am biased. – Basile Starynkevitch Feb 29 '16 at 14:56
  • Depends on your editor/IDE. GCC and VIM support "ctags", as do several linux IDEs. in VIM you can use `ctrl-]` or `g ctrl-]` to go to the symbol under the cursor. http://vim.wikia.com/wiki/Browsing_programs_with_tags – kfsone Feb 29 '16 at 22:09
  • Also, several IDEs have functionality to do exactly this, e.g. Visual Studio's Refactor>Rename. Personally I use Whole Tomato's Visual AssistX and I've done something like this a number of times on much larger code bases. – kfsone Feb 29 '16 at 22:11

3 Answers3

5

IMHO there is a good answers on a similar question at "Unix & Linux":

grep works on pure text and does not know anything about the underlying syntax of your C program. Therefore, in order not search inside comments you have several options:

  1. Strip C-comments before the search, you can do this using gcc -fpreprocessed -dD -E yourfile.c For details, please see Remove comments from C/C++ code

  2. Write/use some hacky half-working scripts like you have already found (e.g. they work by skipping lines starting with // or /*) in order to handle the details of all possible C/C++ comments (again, see the previous link for some scary testcases). Then you still may have false positives, but you do not have to preprocess anything.

  3. Use more advanced tools for doing "semantic search" in the code. I have found "coccigrep": http://home.regit.org/software/coccigrep/ This kind of tools allows search for some specific language statements (i.e. an update of a structure with given name) and certainly they drop the comments.

https://unix.stackexchange.com/a/33136/158220

Although it doesn't completely cover your "not in strings" requirement.

Community
  • 1
  • 1
g0hl1n
  • 1,367
  • 14
  • 28
3

It might practically depend upon the size of your code base, and perhaps also on the editor you are usually using. I am suggesting to use GNU emacs (if possible on Linux with a recent GCC compiler...)

For a small to medium size code (e.g. less than 300KLOC), I would suggest using the grep mode of Emacs. Then (assuming you have bound the next-error Emacs function to some key, perhaps with (global-set-key [f10] 'next-error) in your ~/.emacs...) you can quickly scan every occurrence of float (even inside strings or comments, but you'll skip very quickly such occurrences...). In a few hours you'll be done with a medium sized source code (and that is quicker than learning how to use a new tool).

For a large sized code (millions of lines), it might be worthwhile to customize some static analysis tool or compiler. You could use GCC MELT to customize your GCC compiler on Linux. Its findgimple mode could be inspirational, and perhaps even useful (you probably want to find all Gimple assignments targeting a float)

BTW, you probably don't want to replace all occurrences -but only most of them- of the float type with double (probably suitably typedef-ed...), because very probably you are using some external (or standard) functions requiring a float.

The CADNA tool might also be useful, to help you estimate the precision of results (so help you deciding when using double is sensible).

Using semantical tools like GCC MELT, CADNA, Coccinelle, Frama-C (or perhaps Fluctuat, or Coccigrep mentioned in g0hl1n's answer) would give more precise or relevant results, at the expense of having to spend more time (perhaps days!) in learning and customizing the tool.

Community
  • 1
  • 1
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
1

The robust way to do this should be with cscope (http://cscope.sourceforge.net/) in line-oriented mode using the find this C symbol option but I haven't used that on a variety of C standards so if that doesn't work for you or if you can't get cscope then do this:

find . -type f -print |
while IFS= read -r file
do
    sed 's/a/aA/g; s/__/aB/g; s/#/aC/g' "$file" |
    gcc -P -E - |
    sed 's/aC/#/g; s/aB/__/g; s/aA/a/g' |
    awk -v file="$file" -v OFS=': ' '/\<float\>/{print file, $0}'
done

The first sed replaces all hash (#) and __ symbols with unique identifier strings, so that the preprocessor doesn't do any expansion of #include, etc. but we can restore them after preprocessing.

The gcc preprocesses the input to strip out comments.

The second sed replaces the hash-identifier string that we previously added with an actual hash sign.

The awk actually searches for float within word-boundaries and if found prints the file name plus the line it was found on. This uses GNU awk for word-boundaries \< and \>.

The 2nd sed's job COULD be done as part of the awk command but I like the symmetry of the 2 seds.

Unlike if you use cscope, this sed/gcc/sed/awk approach will NOT avoid finding false matches within strings but hopefully there's very few of those and you can weed them out while post-processing manually anyway.

It will not work for file names that contain newlines - if you have those you can put the body in a script and execute it as find .. -print0 | xargs -0 script.

Modify the gcc command line by adding whatever C or C++ version you are using, e.g. -ansi.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • One caveat is that this does not prevent expansion of builtin macros with a single underscore (.e.g. `#define _LP64`) or that don't have underscores (e.g. `#define linux`) – hugomg May 01 '22 at 11:52
  • @hugomg yes it does, see the `s/#/aC/g` in the first sed script that would convert `#define anything` to `aCdefine anything` before `gcc` sees it (or just try it). – Ed Morton May 01 '22 at 11:54
  • Not in this case because `_LP64` and `linux` are builtin macros that the preprocessor always expands. Sorry, it would probably have been clearer if I wrote the comment without the `#define` bit, because there is no `#define` here. – hugomg May 01 '22 at 12:01
  • 1
    @hugomg Now I understand. I just found a reference to such apparently system-specific macros at https://gcc.gnu.org/onlinedocs/cpp/System-specific-Predefined-Macros.html#System-specific-Predefined-Macros which appears to be saying don't use those macros that don't start with 2 underscores (`We are slowly phasing out all predefined macros which are outside the reserved namespace. You should never use them in new programs`) and use the `__` equivalents instead, i.e. `__LP64__` and `__linux__` in this case. – Ed Morton May 01 '22 at 12:09