0

I implemented some CUDA code. It runs fine but the alogrithm inherently produces a strong thread divergence. This is expected.

I will later try to reduce divergence. But for the moment I would be happy to be able to measure it.

Is there an easy way (prefereably using a runtime API call or a CLI tool) to check how many of my initially scheduled warps and/or threads are still active?

Silicomancer
  • 8,604
  • 10
  • 63
  • 130
  • Besides the official tools (for which others may answer soon), you can find out the multiprocessor and (even its subpartition) your warps are running on (or which threads are currently active inside the warp). With a few selected atomic accesses you can log the order in which they are scheduled and end. There is some not fully synchronous cycles clock on each multiproessor. If you distribute your kernel calls on several streams you can put CPU callbacks there, when the specific kernel call was finished. – Sebastian Jan 26 '22 at 16:56
  • Look here and read the special registers with small assembly blocks (there are examples around online) https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers A CTA (Cooperative Thread Array) is effectively the same as a block – Sebastian Jan 26 '22 at 16:56
  • There is no runtime API call or CLI tool to "check how many of my initially scheduled warps and/or threads are still active". This isn't really the domain of any of the profilers either. (They can give statistical measures of warp divergence, after the fact, not real-time.) You could build your own harness, perhaps, as indicated above. – Robert Crovella Jan 26 '22 at 17:16
  • @RobertCrovella If no tools are available to provide real-time divergence statistics, what alternative tools are available? – Silicomancer Jan 26 '22 at 18:18
  • Can clock_t measurement within the kernel be used to build a database for visualization of start-stop of all warps? – huseyin tugrul buyukisik Jan 27 '22 at 09:33
  • The clocks are not well synchronized between the multiprocessors (in my tests with a mobile RTX 2060). Apart from that you can let thread 0 of each warp store the start/stop time into memory. This works. Probably you also want to store the SM number for each block and the SM sub-partition number (0..3) for each warp to better interpret the values. – Sebastian Jan 27 '22 at 10:36
  • I think you can get better "synchronized" results using PTX [globaltimer](https://forums.developer.nvidia.com/t/question-about-ptxs-globaltimer/188506). You'd have to get pretty inventive to discover anything about divergence using any of this, from what I can see. – Robert Crovella Jan 27 '22 at 15:27
  • Sounds like a lot of effort. And a job for a CUDA expert not a total beginner. Anyway I am very surprised that there is no easy way to analyze divergence since it is such a common and serious issue in GPU programming. I guess I will retreat to static analysis and a bit try-and-error. – Silicomancer Jan 27 '22 at 15:57
  • Divergence is mostly an issue within warps. The problem is also not, how many threads still are active. Even, if all 32 are still active until the end, if they diverged, it would take up to 32 times as long for them to get there. You could compare running your program with 1 active thread per warp (by returning the other 31 at once) to having 32 active threads per warp. If both results are the same, you have a lot of divergence. You could also check Nsight Compute for divergence and uncoalesced memory accesses. – Sebastian Jan 27 '22 at 16:20
  • @RobertCrovella I see. I ported the most simple of my algorithms. It has only one branch and I guess I will be able to code it in a non-divergent way. I will have a look on profilers for the more complex algorithms. But there is a huge divergence because the number of iterations differ a lot for each thread. Some threads exit after a few thousand iterations others after a few million. A spread-sheet calculation showed that this reduces the efficency of an average warp to ~25%. I have no idea to fix this and the web doesn't mention this kind of divergence a lot. – Silicomancer Jan 27 '22 at 16:20
  • Can you save the state and redistribute the remaining work on threads, so that each warp is fully occupied again? – Sebastian Jan 27 '22 at 16:23
  • @Sebastian Hm, using the thread count that way that is a good idea. However it turns out my current issue isn't branch induced divergence but algorithmic run time of threads (at least for the very basic loop I ported). – Silicomancer Jan 27 '22 at 16:25
  • @Sebastian Good idea. Saving and restoring the computation state is something I will need anyway (so the computations can be paused/resumed by the user). I would only need to save/restore an accumulator variable and the intermediate result. I will think about that. I also speculated about creating a global task buffer with an atomic read-index variable where threads could fetch new input instead of exiting. The latter idea seems elegant since threads would keep running but that also introduces a branch-divergence that could be worse than the exit-divergence :( – Silicomancer Jan 27 '22 at 16:48
  • @Sebastian About that save/restore idea... interrupting the threads is something that needs to be done kooperatively by the threads (e.g. by exiting the loops after counted intervals) not something that can be done in a preemptive way e.g. by some API call, right? – Silicomancer Jan 27 '22 at 16:58
  • 1
    Yes, you could let all warps in a block stop every 1000 iterations, store the state, sync the block, reread the state and continue. (That would probably be even faster than just doing 1000 iterations per kernel call and restart the kernels from host or from device with Dynamic Parallelism.) You would have to find out the indices, where to store and which to read back in depending on which threads are still running, but this is solvable. – Sebastian Jan 27 '22 at 18:16
  • 1
    @Sebastian Thanks for your help. Gave up on measuring divergence directly. Using my spread-sheet estimation and your proposal I was able to reduce the runtime of my reference algorithm from 313s (for my first naive implementation) to 7s. This is pretty impressive considering that a powerful 16-core HT CPU needed 286s for the original SISD algorithm. Your proposal (pipelining of jobs and computing in small servings) resulted in a speed-up by factor 2.3 which is pretty close to the loss by estimated divergence. Yay! – Silicomancer Jan 29 '22 at 14:26
  • Cool good success further! – Sebastian Jan 29 '22 at 15:51

2 Answers2

2

Besides the solutions given in the comments, you can use Nsight Compute to profile your kernels. You can try its CLI and then see the results in its GUI, e.g.:

ncu --export output --force-overwrite --target-processes application-only \
  --replay-mode kernel --kernel-regex-base function --launch-skip-before-match 0 \
  --section InstructionStats \
  --section Occupancy \
  --section SchedulerStats \
  --section SourceCounters \
  --section WarpStateStats \
  --sampling-interval auto \
  --sampling-max-passes 5 \
  --profile-from-start 1 --cache-control all --clock-control base \
  --apply-rules yes --import-source no --check-exit-code yes \
  your-appication [arguments]

Then, in its GUI you can see some useful information. For example, in the section source counters you can see something like this:

enter image description here

AmirSojoodi
  • 1,080
  • 2
  • 12
  • 31
  • 1
    Unfortunately ncu says "Profiling is not supported on this device". I use a Tesla K80 with CC3.7 which is not supported anymore. – Silicomancer Feb 14 '22 at 19:57
  • 1
    It turned out that nvvp works with my card. Unfortunately I couldn't find those source counters there. Did I overlook the right stats? Do you know if nvvp support similar numbers about divergence and branches? – Silicomancer Mar 29 '22 at 22:09
1

I found a solution that gives me pretty nice results. Calling the following function from some lines of a kernel (and adapted using a proper filter condition) prints the number of active threads of the current warp:

__device__ void printConvergentThreadCount(int line) // Pass __LINE__
{
   const int count = __popc(__activemask());
   const int threadId = blockIdx.x * blockDim.x + threadIdx.x;
   if (threadId == 0) // Filter
   {
      printf("Line %i: %i\n", line, count);
   }
}

Still this doesn't give numbers as long as kernels are running.

Silicomancer
  • 8,604
  • 10
  • 63
  • 130
  • 1
    For live results during the kernel run you can directly write into host memory mapped to Cuda. – Sebastian Mar 30 '22 at 04:52
  • Directly into host RAM? How can I do that? I thought this is only possible by writing into global device memory and then copying it to host memory. – Silicomancer Mar 31 '22 at 23:22
  • 1
    It is also called zero-copy memory, here is an example https://developer.ridgerun.com/wiki/index.php?title=NVIDIA_CUDA_Memory_Management With 64-bit addresses the graphics card knows, whether the data is on the GPU or the RAM of the system and for the latter one does the memory transactions over PCIe during execution of the kernels. – Sebastian Apr 01 '22 at 01:49