As question states I would to know if there are any possibilities to obtain more deep profiling data for particular CUDA function inside kernel. I have developed some rather complicated implementation of Miller-Rabin algorithm with arbitrary large-integers, that need to fine-tuned, generally I have already found some "suspect code":
/* Iterate over bits of t from bit-position D-2 to s (both inclusive).
Note that t is just effectively n with special treatment, namely we skip
s least-significant bits as stated previously (refer to initialization section) */
for (int j = SIZ(n) * CUMP_LIMB_BITS - 2; j >= s; j--) {
/* b_helper = b**2 mod n */
cumpz_mulmod(b_helper, b, b, n);
/* Swap b with b_helper */
cumpz_swap(b, b_helper);
if (cump_tstbit(PTR(n), j)) {
/* b = b_helper * a mod n */
cumpz_mulmod(b_helper, b, a, n);
/* Swap b with b_helper */
cumpz_swap(b, b_helper);
}
}
How can I get things like total number of calls, overall time of cumpz_mulmod
, time per one call etc. ? Note that cumpz_mulmod
calls other functions inside, e.g. cumpz_add
, which calls other functions etc. Ideally I would like get the same info as from gprof
such as call graph, but as I already have researched this seems not be possible at all, since all __device__
functions are inlined. I found that nvprof
is only useful to profile kernel as a whole.