6

Background

I have inherited a legacy 60kloc g++ project which I would like to refactor to enforce a consistent naming convention throughout the project.

Question

Is there a free/open-source static analysis tool which can generate a list of:

  • global symbols
  • class names
  • member methods (public/protected/private, if possible)
  • member variables
  • static methods
  • local symbols (will probably ignore these)
  • any other symbols I may have missed, but may impact a reader of code

Approach

My intention is to use vim to edit the generated list of symbols and then use a Ruby-script to do a very rough search-and-replace/mapping on symbols so that at the very least, naming conventions are consistent.

The procedure is a little ugly and I expect the initial compile to fail, but I don't mind going through and fix problems by hand if I can have a more readable set of code.

What kinds of tools do developers of large C++ code bases use to do this kind of refactoring?

kfmfe04
  • 14,936
  • 14
  • 74
  • 140
  • 1
    Hackish workaround: compile, then `nm -gC`? –  Jun 16 '13 at 06:06
  • @H2CO3 +1 interesting - unfortunately, in my case, that generates 21klines of raw result. I will revisit that approach if nothing better comes up... (with filtering of the raw results - nm spits out much more info than I need prolly) - use of templates + a C++ reflection library probably explains some of that symbol explosion... – kfmfe04 Jun 16 '13 at 06:09
  • If that's 21klines, and the overall project is 60klines, it means that you have a class or function definition on every 3rd line. Weird, but not impossible, in my opinion. –  Jun 16 '13 at 06:10
  • @H2CO3 if I use nm --demangle and grep on my_namespace, I'm down to 10klines - will see if I can filter some more. fwiw, in that other 11klines, there was a lot of boost and other 3rd party libraries... – kfmfe04 Jun 16 '13 at 06:25
  • If you were using vs2012 there is a builtin reactoring tool available with VS. – shivakumar Jun 16 '13 at 06:59
  • If what you want is a way to generate a list of all symbols, and a way to rename any or all of the symbols consistently, I have an answer, but it is a bit surprising, and it is not free or open source but I think it will do the job. Do you want me to answer with this solution? – Ira Baxter Jun 16 '13 at 08:20
  • @IraBaxter sure: if perchance, it turns out to be too expensive or not useful for me, it might still be useful for others – kfmfe04 Jun 16 '13 at 08:32
  • @kfmfe04 Although I haven't tried it myself, [it is claimed](http://stackoverflow.com/a/13840863/341970) that you can do this with Clang and the tools built on it. – Ali Jun 16 '13 at 08:53

2 Answers2

1

Automatic refactoring of C++ is extremely hard, in part to do with the preprocessor (macros and file inclusion), but mostly to do with the interdependency between parsing, name lookup and the rest of the semantic analysis phase (template instantiation, constant expressions, overload resolution, etc, etc). On the very large C++ codebases I have worked on, automatic refactoring is simply not done, and because of the inherent difficulty, the quality of refactoring tools is poor.

Since the emergence of clang though, which specifically has a modular front-end so you can access the AST in a nicer way than other tools, there may be some better refactoring tools based on it - but I wouldn't hold my breath.

Take a look at the AST dump from clang, perhaps you can write a script on the XML to give you a dump that might form a starting point for refactoring it by hand.

user1741137
  • 4,949
  • 2
  • 19
  • 28
Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
0

Op wants to do mass renaming, e.g., generate a list of names, and then rename many of them across a big source code base.

A refactoring tool that was good at this is a choice, if he can find one.

A strange but perhaps effective alternative: a C++ code source code obfuscation tool.

Our company offers one of these that does the following (yes, this will seem wrong for the task!):

  • strips comments
  • damages formatting
  • replaces identifiers consistently with scrambled names (seed of the answer!)
  • builds a identifier map (list of "identifier -> scrambled_identifier" names) as a result for all identifiers.

This process is applied to files without preprocessing.

So, in effect, it is a mass renaming tool. And renaming to bad names is its purpose, but it can be abused to rename to good names.

In fact, what it accepts as an input is an identifier map (possibly empty, certainly on the first run, usually taken from successive obfuscation runs), and it renames identifiers it finds in that map according to the map, and identifiers it doesn't find with new scrambled names.

If you give it a full map, you have full control over the names it renames-to.

So, to use it for mass renaming, the following process should work:

  1. Run the obfuscator, Get the identifier map. Throw the result source text away.
  2. Revise the identifier map to be "identifier -> identifier". This is a 30 second task with a decent editor like Emacs. If one uses this revised map unchanged, the obfuscator renames every symbol to itself, e.g., nothing gets renamed. Replacing "identifier -> foo" with just "identifier" is treated as "identifier -> identifier" by the tool.
  3. (Sort then) review the identifier list. Choose new names for some of the identifiers. Revise the list accordingly: "bad_identifier_1 -> better_identifier_1"
  4. Re-run the obfuscator, using the revised map. Your bad_identifiers will get replaced.

Oops, what about comments and formatting :-?

Well, there's a command line switch that in essence says "don't throw the comments away". As far as formatting is concerned, the obfuscator remarkably includes a source code formatter. Just run it a second time as a formatter. Voila, renamed-code with pretty format.

Caveats:

  • the formatter cannot handle some badly-placed preprocessor conditionals; most C++ code doesn't have this, and what there is of it can be usually changed with a one line edit.
  • the obfuscator does not distinguish scopes. Given I -> J, it will rename all I instances to J.
  • the obfuscator won't detect stupid renamings. If you rename I -> J, and rename K -> J, if that renaming damages your program the obfuscator won't tell you. (That renaming may work; depends on your code and where I and K are used). This is easily avoided: don't produce a map with the same renamed-to name anywhere. THis means you should not rename identifiers which appear in system include files; you can rename identifiers that appear in your applications include files.

If there was enough interest, minor changes on our part could preserve formatting and comments directly.

The nice thing about this klunky process is you can experiment with getting the set of renames right; you only need to keep the final "obfuscated/formatted" result. You can of course rename sets of things in groups by running this process one per stages. Highly recommend recompiles after each cycle :-}

You can use this process to rename one identifier at a time, but I think a regular editor would serve your pretty well for this.

If OP just wanted the list of names, he could obviously stop after the first obfuscation pass and run away with the identifier map.

No, it isn't a regexp-replace-string hack; it uses a full C++11 lexer so it is not confused by contents of string literals or comments. The formatter part actually uses a full C++(11) parser.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341