2

I have read on many forums that NVIDIA Visual Profiler serializes the program in order to collect timing information.

However in the visual profiler, under context tab, offers advice such as "There is no time overlap between memory copies and kernels on GPU" or if there are overlaps with memory and kernel execution it displays the time of overlap. Also if you look at the following webinar - slide 6 you can see an output trace of overlapping kernels.

I want to know if the profiler can display information regarding concurrent kernel execution (i.e if we run 3 kernels in parallel using 3 different streams, can the profiler show if this is indeed happening in the GPU). If so, where in the visual profiler can I get hold of this information.

tomix86
  • 1,336
  • 2
  • 18
  • 29
shadow
  • 141
  • 1
  • 7

1 Answers1

2

Yes.

Both nvprof and Visual Profiler (nvvp) in CUDA Toolkit 5.0 (available as a preview release to registered CUDA developers) support concurrent kernel execution.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Eugene
  • 9,242
  • 2
  • 30
  • 29
  • what about CUDA Toolkit 4.0 does that not allow concurrent kernels to be viewed using the Visual Profiler. (note concurrent kernels execution not memorycpy and kernel execution overlap) – shadow Aug 07 '12 at 16:22
  • As far as I remember, kernels were ran synchronously in pre-5.0 profiler. – Eugene Aug 07 '12 at 16:23
  • by 5.0 profiler i guess you are referring to CUDA 5.0 which is new and only for registered users. Concurrent kernel execution was available well before since the introduction of Fermi architecture. (that is all devices with CUDA compute capability 2.x if i am not mistaken). Any how do you know how to display this kernel concurrency using your profiler (is it the GPU time width plot). – shadow Aug 07 '12 at 16:38
  • True, you can run kernels concurrently on pre-5.0 CUDA toolkits provided you have Fermi or Kepler hardware. But the kernels were ran serially when application was executed in profiling mode. 5.0 profiler no longer has that restriction and its timeline will properly display kernels running concurrently (e.g. their runs will be overlapping) – Eugene Aug 07 '12 at 18:27
  • Nsight Visual Studio Edition has supported concurrent kernel trace since 2.0 (Fermi launch). Nsight 2.1 added concurrent trace of device to device memory copies and memory set operations (implemented as kernels in most cases). Visual Profiler 5.0 uses the same solution as Nsight Visual Studio Edition. – Greg Smith Aug 08 '12 at 00:59
  • Thanks :) guess i ll have to get my hands on CUDA 5.0 – shadow Aug 08 '12 at 15:42
  • I managed to get CUDA 5. But i am constantly getting segmentation faults when running programs that are compiled using CUDA 5.0 (these are all programs that run concurrent kernels using streams) Did any of you face this problem. – shadow Aug 09 '12 at 14:34
  • Try to compile and run some simple CUDA application (e.g. vectorAdd SDK sample) to see if the problem is with CUDA toolkit install or in your code. I have never seen the issue you describe. – Eugene Aug 09 '12 at 15:56
  • when trying to run the CUDA samples i get an error message [g++: error: ../../common/lib/linux/i386/libGLEW.a: No such file or directory] after a while of compiling. Then when trying to run the deviceQuery (which obviously does compile) it throws the error message [CUDA driver version is insufficient for CUDA runtime version]. my CUDA version is [nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2012 NVIDIA Corporation Built on Wed_May__2_12:51:20_PDT_2012 Cuda compilation tools, release 5.0, V0.2.1221] – shadow Aug 10 '12 at 12:00
  • and my driver version is [NVRM version: NVIDIA UNIX x86 Kernel Module 295.71 Thu Aug 2 19:30:55 PDT 2012 GCC version: gcc version 4.6.2 (Ubuntu/Linaro 4.6.2-10ubuntu1~10.04.2)]. According to CUDA 5 release notes my device, geforce 9200m gs is compatible with cuda 5 and this is the latest driver i can find. Do you know anything that can help me out here. Note the only thing that my os doesnt match is GLIBC version, but i dont think that should affect here. – shadow Aug 10 '12 at 12:01
  • is it also possible to know your driver version so i can hopefully find a beta driver for my device that could solve the problem – shadow Aug 10 '12 at 12:25
  • Ok i figured it out, if any one has problems getting correct results, or see possible stack overflows with CUDA 5 it is possibly due to a mismatch between the driver version and CUDA toolkit version. You can verify this by running the deviceQuery. You will most likely have to install a beta driver for your device that is of version 3xx.xx – shadow Aug 10 '12 at 15:08