What are the possibilities to profile particular device function within CUDA kernel?

Question

As question states I would to know if there are any possibilities to obtain more deep profiling data for particular CUDA function inside kernel. I have developed some rather complicated implementation of Miller-Rabin algorithm with arbitrary large-integers, that need to fine-tuned, generally I have already found some "suspect code":

/* Iterate over bits of t from bit-position D-2 to s (both inclusive).
   Note that t is just effectively n with special treatment, namely we skip
   s least-significant bits as stated previously (refer to initialization section) */
for (int j = SIZ(n) * CUMP_LIMB_BITS - 2; j >= s; j--) {
    /* b_helper = b**2 mod n */
    cumpz_mulmod(b_helper, b, b, n);

    /* Swap b with b_helper */
    cumpz_swap(b, b_helper);

    if (cump_tstbit(PTR(n), j)) {
        /* b = b_helper * a mod n */
        cumpz_mulmod(b_helper, b, a, n);

        /* Swap b with b_helper */
        cumpz_swap(b, b_helper);
    }
}

How can I get things like total number of calls, overall time of cumpz_mulmod, time per one call etc. ? Note that cumpz_mulmod calls other functions inside, e.g. cumpz_add, which calls other functions etc. Ideally I would like get the same info as from gprof such as call graph, but as I already have researched this seems not be possible at all, since all __device__ functions are inlined. I found that nvprof is only useful to profile kernel as a whole.

There have been various answered questions on this topic already, such as [this one](http://stackoverflow.com/questions/15084325/cuda-nsight-vs2010-profile-device-function)(windows-specific) and [this one](http://stackoverflow.com/questions/10585990/how-how-to-measure-the-inner-kernel-time-in-nvidia-cuda). Also, visual profiler has an "Exec Count" [column](http://docs.nvidia.com/cuda/profiler-users-guide/index.html#whats-new) in "kernel profile analysis" view that can show execution counts of lines of source, that may help(identifying relative hotspots, in a way). — Robert Crovella, Sep 03 '14 at 12:31
@RobertCrovella: I should have mentioned that I am working on GNU/Linux (CentOS 6 to be specific), note that the accepted answer for second question is misleading, as it simply does not work inside kernel. However `clock()` (eventually `clock64()`) "manual profiling" method seems to be *some* (only ?) way to go. — Grzegorz Szpetkowski, Sep 03 '14 at 12:41
Yes, there's nothing I can do about what answer is accepted by someone else. Since there is already a comment on that answer that it does not work, I assumed I did not need to repeat it here. But the answer given by talonmies to that question is instructive. — Robert Crovella, Sep 03 '14 at 12:44
I think the `clock64()` method is one (partial) approach. And if you're looking for a tool that does what gprof does, I don't think you'll find an exact copy. But it certainly seemed to me that the "Exec Count" feature could more or less directly answer your question "total number of calls" and inferentially or in a relative sense answer "overall time" and "time per one call". — Robert Crovella, Sep 03 '14 at 12:53

What are the possibilities to profile particular __device__ function within CUDA kernel?

0 Answers0

What are the possibilities to profile particular device function within CUDA kernel?