Call graphs for CUDA

Question

I am trying to generate call graphs for a code that I have in CUDA with egypt but the usual way doesn't seem to work (since nvcc doesn't have any flag that can do the same thing as -fdump-rtl-expand).

More details :

I have a really large code (of which I am not the author) that spans over multiple .cu files and it would be easier for me to understand what it's doing if I had a call graph.

I bet that an answer to this question would be of use to other people as well.

Any ideas on how this can be done with cuda (.cu) files?

This is just a thought, but maybe you could place your kernels in a separate source file with your C code calling wrapper functions instead. This would allow you to use GCC to compile most of your code with GCC, and then just use NVCC for compiling your kernels. Unless your trying to look at a call graph for the runtime or driver API, in which case I'm not sure. — sj755, Jul 11 '12 at 06:31
I am more interested in the device part of the code. @sj755, thanks, I will see what I can do. — Kostis, Jul 11 '12 at 15:46
@Konstantinos Sorry if my advice wasn't really relevant to what you're doing. However, if you're trying to figure out the internal mechanisms within the standard CUDA functions and constructs (i.e cudaMalloc, events, kernel calls), I would take a look at the driver API. The CUDA C programming guide's description of it isn't too extensive, but it does sort of cover what's happening on a lower level. — sj755, Jul 12 '12 at 05:43

score 1 · Accepted Answer · answered Apr 29 '16 at 19:29

You can do this with the CUDA support of clang 3.8.

First, compile your CUDA code to emit llvm (example on Windows with CUDA 7.5 installed):

clang++ -c main.cu --cuda-gpu-arch=sm_35 -o main.ll -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include"

Then, use the generated ll to build the callgraph with opt:

opt.exe main.ll -analyze -dot-callgraph

Note that opt is not part of the default binary distribution, you may need to build it yourself (I had a 3.7.1 build and it has been able to manage the ll from 3.8).

Example main.cu file:

#include <cuda_runtime.h>
__device__ int f() { return 1; }
__device__ float g(float* a) { return a[f()] ; }
__device__ float h() { return 42.0f ; }
__global__ void kernel (int a, float* b)
{
        int c = a + f();
        g(b);
        b[c] = h();
}

Generated dot file:

digraph "Call graph" {
        label="Call graph";

        Node0x1e3d438 [shape=record,label="{external node}"];
        Node0x1e3d438 -> Node0x1e3cfb0;
        Node0x1e3d438 -> Node0x1e3ce48;
        Node0x1e3d438 -> Node0x1e3d0a0;
        Node0x1e3d438 -> Node0x1e3d258;
        Node0x1e3d438 -> Node0x1e3cfd8;
        Node0x1e3d438 -> Node0x1e3ce98;
        Node0x1e3d438 -> Node0x1e3d000;
        Node0x1e3d438 -> Node0x1e3cee8;
        Node0x1e3d438 -> Node0x1e3d078;
        Node0x1e3d000 [shape=record,label="{__cuda_module_ctor}"];
        Node0x1e3d000 -> Node0x1e3ce98;
        Node0x1e3d000 -> Node0x1e3d168;
        Node0x1e3d078 [shape=record,label="{__cuda_module_dtor}"];
        Node0x1e3d078 -> Node0x1e3cee8;
        Node0x1e3cfb0 [shape=record,label="{^A?f@@YAHXZ}"];
        Node0x1e3d0a0 [shape=record,label="{^A?h@@YAMXZ}"];
        Node0x1e3ce48 [shape=record,label="{^A?g@@YAMPEAM@Z}"];
        Node0x1e3ce48 -> Node0x1e3cfb0;
        Node0x1e3d258 [shape=record,label="{^A?kernel@@YAXHPEAM@Z}"];
        Node0x1e3d258 -> Node0x1e3cfb0;
        Node0x1e3d258 -> Node0x1e3ce48;
        Node0x1e3d258 -> Node0x1e3d0a0;
        Node0x1e3d168 [shape=record,label="{__cuda_register_kernels}"];
        Node0x1e3cee8 [shape=record,label="{__cudaUnregisterFatBinary}"];
        Node0x1e3cee8 -> Node0x1e3d528;
        Node0x1e3cfd8 [shape=record,label="{__cudaRegisterFunction}"];
        Node0x1e3cfd8 -> Node0x1e3d528;
        Node0x1e3ce98 [shape=record,label="{__cudaRegisterFatBinary}"];
        Node0x1e3ce98 -> Node0x1e3d528;
}

Call graphs for CUDA

1 Answers1