18

I'm looking for a tool to statically generate a call graph of the Linux kernel (for a given kernel configuration). The generated call graph should be "complete", in the sense that all calls are included, including potential indirect ones which we can assume are only done through the use of function pointers in the case of the Linux kernel.

For instance, this could be done by analyzing the function pointer types: this approach would lead to superfluous edges in the graph, but that's ok for me.

ncc seems to implement this idea, however I didn't succeed in making it work on the 3.0 kernel. Any other suggestions?

I'm guessing this approach could also lead to missing edges in cases where function pointer casts are used, so I'd also be interested in knowing whether this is likely in the Linux kernel.

As a side note, there seems to be other tools that are able to do semantic analysis of the source to infer potential pointer values, but AFAICT, none of them are design to be used in a project such as the Linux kernel.

Any help would be much appreciated.

addalbx
  • 545
  • 1
  • 4
  • 9
  • You didn't succeed in making it work? How much effort did you invest? In my answer below, we analyzed a huge system. It took us several man-months to reliably capture the right sources, the compilation data, and to process it. Do you think doing this for Linux should be a lot faster the first time you try? – Ira Baxter Feb 27 '12 at 23:24
  • Well, I simply followed to the letter the use case example for the Linux kernel that is provided in the ncc documentation. This resulted in an error that I am still working on trying to figure out. – addalbx Feb 27 '12 at 23:27
  • OK, so NCC supposedly comes set up to process Linux. Then your time should be a lot shorter since somebody else has done all this work. Of course, such work is not likely to be stable; the Linux kernal moves. Have you contacted the NCC builders and asked them about this? – Ira Baxter Feb 28 '12 at 00:16
  • Yes, exactly. No reply so far from them, that's why I am looking into other possibilities as well. – addalbx Feb 28 '12 at 08:38
  • 1
    This appears to be an academic project, and you appear to want real data from a real system. Academics aren't rewarded for building such tools and have a pretty bad track record for this (for good reasons). Maybe this is a great tool, and it handles all the complexities of real systems. It wouldn't be a bet I'd make. (Yes, I'm biased; I built a commercial company precisely to avoid those issues with real tools.) – Ira Baxter Feb 28 '12 at 17:37

1 Answers1

4

We've done global points-to analysis (with indirect function pointers) and full call graph construction of monolithic C systems of 26 million lines (18,000 compilation units).

We did it using our DMS Software Reengineering Toolkit, its C Front End and its associated flow analysis machinery. The points-to analysis machinery (and the other analyses) are conservative; yes, you get some bogus points-to and therefore call edges as a consequence. These are pretty hard to avoid. You can help such analyzers by providing certain crucial facts about key functions, and by harnessing knowledge such as "embedded systems [and OSes] tend not to have cycles in the call graph", which means you can eliminate some of these. Of course, you have to allow for exceptions; my moral: "in big systems, everything happens."

The particular problem included dynamically loaded(!) C modules using a special loading scheme specific to this particular software, but that just added to the problem.

Casts on function pointers shouldn't lose edges; a conservative analysis should simply assume that the cast pointer matches any function in the system with signature corresponding to the casted result. More problematic are casts which produce sort-of-compatible signatures; if you cast a function pointer to void* foo(uint) when the actual function being called accepts an int, the points to analysis will necessarily conservatively choose the wrong functions. You can't blame the analyzer for that; the cast lies in that case. Yes, we saw this kind of trash in the 26 million line system.

This is certainly the right scale for analyzing Linux (which I think is a mere 8 million lines or so :-). But we haven't tried it specifically on Linux.

Setting up this tool is complicated because you have to capture all the details about the compilations themselves, and in particular the configuration of the Linux kernal you want to generate. So you pretty much have to intercept the compiler calls to get the command line switches, etc.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 2
    Unfortunately, I would like to refrain from using commercial tools. Thank you for your help. – addalbx Mar 30 '12 at 08:53
  • @addalbx: Did you ever find an answer? (I see you tried LLVM, too). – Ira Baxter May 31 '12 at 03:53
  • 1
    I gave up, although I had mild success. Things tend to roughly work actually even with ncc, the main obstacle is the kernel's gcc-isms - it requires a lot of fiddling around... correct me if I'm wrong, but it wouldn't be too different with our software - or? – addalbx Sep 12 '12 at 15:07
  • It is a PITA in general to extract flow analysis from a complex C application and Linux is well towards the top of the scale. Mostly you have to capture the command-line build data often hiding in Make scripts, and that often takes some effort. GCCisms we handle pretty well; we have C dialects for ANSI89, GCC2/3/4, C99 that each understand the differences between the dialects. Yes, its fiddly to get it set up. I think we do better than "roughly works", but that's an opinion you have to validate on your own :-{ – Ira Baxter Sep 12 '12 at 16:58