1

I would like to know how could I find out the exact location where my application spends more time. It is C++ code with CUDA calls, so from the C++ code, I have created wrappers that call the CUDA code. Timing the C++ code, gives 5 seconds of execution, however if I profile the code in Nsight, the kernel takes 8ms. How can that be possible?

From the c++ code:

double start_divide = get_host_current_time();
callDivideKernel( keep, d_a, d_A_N );
double end_divide = get_host_current_time();
printf("divideKernel : %g\n", end_divide - start_divide);

cu file:

void callDivideKernel(int N, float* A, int* A_N){

  cudaEvent_t start, stop;
  float time;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  dim3 dimGrid(618,128);
  dim3 dimBlock(512);

  cudaEventRecord(start, 0);
  DivideKernel<<< dimGrid,dimBlock >>>(N, A, A_N);
  cudaEventRecord(stop, 0);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&time, start, stop);
  printf("callDividekernel = %f ms\n",time);
  cudaThreadSynchronize();

}

__global__ void DivideKernel(int N, float* A, int* A_N){

  int k =  blockIdx.x * blockDim.x + threadIdx.x +
    blockDim.x*gridDim.x*blockIdx.y;

  int kmax = (N*(N+1))/2;
  int row,col;

  if(k < kmax){
    row = (int)(sqrt(0.25+2.0*k)-0.5); 
    col = k - (row*(row+1))/2;
    int val = max(1, A_N[row*N + col]);
    A[row*N + col] /= (float)val;
  }
}

Results:

callDividekernel = 7.111040 ms
divideKernel : 5.66533
Manolete
  • 3,431
  • 7
  • 54
  • 92
  • 3
    You can use the cuda event api to break your code up into pieces (both cuda portions and non-cuda portions) to see where the overall execution time is spent. It's possible that the kernel is only taking 8ms while other portions (e.g. data copy, and/or non-cuda code) are using up the remainder of the execution time. – Robert Crovella Oct 15 '12 at 19:28
  • 1
    Please provide the relevant part of your code, if you really want a useful answer. – Thomas Berger Oct 15 '12 at 20:27
  • Why wont you use Visual Profiler? – lashgar Oct 15 '12 at 21:14
  • If you are running CUDA 4.0 or higher on a platform that supports UVA (Unified Virtual Addressing), CUDA takes a long time to perform the huge virtual memory allocations at initialization time. I suspect that's what you are seeing. – ArchaeaSoftware Oct 16 '12 at 03:25
  • @ahmad: Visual Profiler (Nsight) gives an average of 8ms for this kernel – Manolete Oct 16 '12 at 08:32
  • @ArchaeaSoftware: I am currently using CUDA 5, does it make any difference? – Manolete Oct 16 '12 at 08:33
  • @ThomasBerger: Please have a look to my code.It seems like it spends a lot of time coming back from the device without any transfer. It does not make sense to me... – Manolete Oct 16 '12 at 08:34
  • @Manolete i assume `float* A, int* A_N` are already device allocated memory segments? – Thomas Berger Oct 16 '12 at 08:47
  • @Manolete: Is this on a Windows platform? – talonmies Oct 16 '12 at 09:04
  • It is Linux platform and everything has been previously allocated and verify if error. The kernel works, but there is this time difference that I can't understand from where it is coming from – Manolete Oct 16 '12 at 09:16
  • Insert a `cudaDeviceSynchronize()` before taking the start time. You may be waiting for other asynchronous call to finish. – tera Oct 16 '12 at 10:59
  • @tera: I've done it, but same result – Manolete Oct 16 '12 at 11:23
  • 1
    OK, next try: Your sample code has no error checking. I assume you do actually check in the code you are testing. Are any error codes returned, particularly form the `cudaThreadSynchronize()` (which by the way is deprecated and should be replaced by `cudaDeviceSynchronize()`)? Five seconds suspiciously looks like a timeout. Is a display connected to your GPU? Does it freeze during those 5s? Are you performing any other CUDA calls as well? Do they succeed (again I assume you check every return code, or you [should not even ask here](http://stackoverflow.com/tags/cuda/info#how-to)). – tera Oct 16 '12 at 15:06
  • what provides get_host_current_time() ? do you have persistence mode enabled in the GPU driver? Can you provide a complete, compilable code example? – Robert Crovella Oct 16 '12 at 15:26
  • 2
    I built a complete, compilable example out of your sample code, substituting clock() from time.h for your get_host_current_time() And the results I got were callDividekernel = 7.078560 ms and divideKernel : 0.01 – Robert Crovella Oct 16 '12 at 16:03

0 Answers0