Performance hit in CUDA program that calls kernel repeatedly within a for loop

Question

I have a CUDA program that calls the kernel repeatedly within a for loop. The code computes all rows of a matrix by using the values computed in the previous one until the entire matrix is done. This is basically a dynamic programming algorithm. The code below fills the (i,j) entry of many separate matrices in parallel with the kernel.

for(i = 1; i <=xdim; i++){

  for(j = 1; j <= ydim; j++){ 

    start3time = clock();
    assign5<<<BLOCKS, THREADS>>>(Z, i, j, x, y, z)
    end3time = clock(); 
    diff = static_cast<double>(end3time-start3time)/(CLOCKS_PER_SEC / 1000); 
    printf("Time for i=%d j=%d is %f\n", i, j, diff); 
  }

}

The kernel assign5 is straightforward

__global__ void assign5(float* Z, int i, int j, int x, int y, int z) {

  int id = threadIdx.x + blockIdx.x * blockDim.x;

  char ch = database[j + id];

  Z[i+id] = (Z[x+id] + Z[y+id] + Z[z+id])*dev_matrix[i][index[ch - 'A']];

  }

}

My problem is that when I run this program the time for each i and j is 0 most of the time but sometimes it is 10 milliseconds. So the output looks like

Time for i=0 j=0 is 0
Time for i=0 j=1 is 0
.
.
Time for i=15 j=21 is 10
Time for i=15 j=22 is 0
.

I don't understand why this is happening. I don't see a thread race condition. If I add

if(i % 20 == 0) cudaThreadSynchronize();

right after the first loop then the Time for i and j is mostly 0. But then the time for sync is sometimes 10 or even 20. It seems like CUDA is performing many operations at low cost and then charges a lot for later ones. Any help would be appreciated.

Surely that is just aliasing because the precision of the time source is too low? — talonmies, Jan 05 '12 at 15:49
I forgot to add that the variables Z and database are device global arrays and dev_matrix and index and device constant arrays. The memory access is coalesced. — Ross, Jan 05 '12 at 15:50
The results are inconsistent because your clock resolution is about 10ms, like talonmies said. Better to just measure the time of the whole loop. If you need high-res, per-iteration measurements, then [this answer](http://stackoverflow.com/a/588377/324625) should help. — Steve Blackwell, Jan 05 '12 at 16:16
I've measured the time now in microseconds using the timer indicated above. Each early iteration (all i < 20) takes 3 or 4 microseconds. It then spikes to about 120 for the next three iterations and 8 or 9 microseconds from then on. In between there are more spikes but most iterations take 8 or 9. Interestingly, if I add a cudaThreadSynchronize() right after the first loop then all iterations take 3 or 4 microseconds without spikes. The cudaThreadSynchronize() takes about 700 microseconds each time. I don't see a race condition in my program. Could you say why this is happening? Thanks. — Ross, Jan 05 '12 at 19:55
Is this on a WDDM windows platform with a recent driver? If so, the driver does batching to amortise the higher latency of that platform. — talonmies, Jan 06 '12 at 00:52
No it's on Linux. I believe I may be measuring the runtime incorrectly. Someone on a different forum said I should be using cudaEventRecord and cudaEventSynchronize. I'm getting consistent results now. — Ross, Jan 06 '12 at 04:06

shapecatcher · Accepted Answer · 2013-02-26T16:46:48.510

7

I think you have a misconception about what a kernel call in CUDA actually does on the host. A kernel call is non-blocking and is only added to the device's queue. If you're measuring time before and after your kernel call, then the difference has nothing to do with how long your kernel call takes (it would measure the time it takes to add the kernel call to the queue).

You should add a cudaThreadSynchronize() after every kernel call and before you measure end3time. cudaThreadSynchronize() blocks and returns if all kernels in the queue have finished their work.

This is why

if(i % 20 == 0) cudaThreadSynchronize();

made spikes in your measurments.

edited Feb 26 '13 at 16:46

answered Jan 06 '12 at 23:28

shapecatcher

907
6
9

1

Thanks, this makes sense now. Also, it appears that cudaMemcpy will block until all threads are finished. – Ross Jan 08 '12 at 19:52
1

Yes, calls to cudaMemcpy block on the host until all threads are finished. This is the standard memcopy function because thats what you want most of the time. There is also cudaMemcpyAsync(), which is the non-blocking variant. – shapecatcher Jan 09 '12 at 00:52
1

Just as reference for future readers: cudaThreadSyncronize is deprecated and cudaDeviceSynchronize should be used instead. – AkiRoss Sep 28 '12 at 09:16

Performance hit in CUDA program that calls kernel repeatedly within a for loop

1 Answers1