1

I have GTX560. I bought GTX760 today. Why is the kernel below slower (~0.031ms) on GTX760 than on GTX560 (~0.0232ms). When I increase n to 1000000, it is faster (~25%), but for small n is not. I have two computers. The first one (GTX560 inside) is Intel(R) Core(TM) i5 CPU, P7P55D-E LX, CUDA 5.0, Kubuntu 12.04. And the second one (GTX760 inside), AMD FX(tm)-6300, mainboard 760GA-P43(FX), CUDA 6.5 Kubuntu 14.04. But I still think, that the reason is not due to the different CPUs, etc...

GTX560: nvcc -arch=sm_20 -fmad=false -O3 -o vecc vecc.cu -lm
GTX760: nvcc -arch=sm_30 -fmad=false -O3 -o vecc vecc.cu -lm

I tried change the blocksize too, but with no fundamental effect.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
 // Get our global thread ID
 int id = blockIdx.x*blockDim.x+threadIdx.x;

 // Make sure we do not go out of bounds
 if (id < n)
 c[id] = sqrt(a[id]*b[id]);
}

int main( int argc, char* argv[] )
 {
  cudaEvent_t start, stop;
  float elapsedTime;

  // Size of vectors
  int n = 512;

  // Host input vectors
  double *h_a;
  double *h_b;
  //Host output vector
  double *h_c;

  // Device input vectors
  double *d_a;
  double *d_b;
  //Device output vector
  double *d_c;

  // Size, in bytes, of each vector
  size_t bytes = n*sizeof(double);

  // Allocate memory for each vector on host
  h_a = (double*)malloc(bytes);
  h_b = (double*)malloc(bytes);
  h_c = (double*)malloc(bytes);

  // Allocate memory for each vector on GPU
  cudaMalloc(&d_a, bytes);
  cudaMalloc(&d_b, bytes);
  cudaMalloc(&d_c, bytes);

  int i;
  // Initialize vectors on host
  for( i = 0; i < n; i++ ) {
   h_a[i] = sin(i)*sin(i);
   h_b[i] = cos(i)*cos(i);
  }

 // Copy host vectors to device
 cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
 cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

 int blockSize, gridSize;

 // Number of threads in each thread block
 blockSize = 256;

 // Number of thread blocks in grid
 gridSize = (int)ceil((float)n/blockSize);

 // Execute the kernel
 cudaEventCreate(&start);
 cudaEventRecord(start,0);

 vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

 cudaEventCreate(&stop);
 cudaEventRecord(stop,0);
 cudaEventSynchronize(stop);

 cudaEventElapsedTime(&elapsedTime, start,stop);
 printf("Elapsed time : %f ms\n" ,elapsedTime);

 // Copy array back to host
 cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

 // Sum up vector c and print result divided by n, this should equal 1 within error
 double sum = 0;
 for(i=0; i<n; i++)
  sum += h_c[i];
 printf("final result: %f\n", sum/n);

 // Release device memory
 cudaFree(d_a);
 cudaFree(d_b);
 cudaFree(d_c);

 // Release host memory
 free(h_a);
 free(h_b);
 free(h_c);

 return 0;
}
  • 5
    GPUs are meant for massively parallel computation. You're launching 512 threads, across two blocks. This doesn't get close to saturating either of your GPUs. What you're actually measuring is probably almost all due to launch overheads. Launch overheads are dependent on your entire system, not just your GPU. – Jez Nov 18 '14 at 00:04
  • And just to further support the comment by @Jez, when you make the problem large enough to saturate the GPU, then the 760 is faster. Furthermore, you should not put your `cudaEventCreate(&stop);` call in the timing region. Do that event creation before you start timing. Jez if you provide an answer I would upvote. – Robert Crovella Nov 18 '14 at 00:23

1 Answers1

5

This is pretty much an extension of the comments, so they deserve the merits.

Here there are two different situations to study:

  • A: 512 threads launched
  • B: 1000000 threads launched

On A you're not giving enough work to your GPU's you're basically measuring the overhead of the kernel execution/launching. As the comments point, it depends on your system. The time spent on the GPU side is negligible.

Here, and here you can find some information and timings, like this chart illustrating the execution overhead for different GPU's:

enter image description here

On B the time spent on the GPU side is larger as you increased the number of threads. In this case, the 760 has better hardware and simply does the job faster, overcoming the kernel launching overhead.

There are some other factors here related to the CUDA programming model itself; feeding more work has positive consequences on the performance of the GPU but I believe that discussion is beyond the scope of this answer. Check these posts (1,2) to get and idea of the topic.

Your kernel is basically limited by memory bandwidth and the 760 is over 192 GB/s while the peak bandwidth of the 560 is around 128 GB/s, so your kernel should run faster on the 760 even if your cards had the same number of cores.

A note about memory transfers

Your code is not affected by memory transfers overhead since they are out of the measured region, but I'll still let here a note about this, since it is useful to explain the performance difference of the complete code.

The overhead and the transfer time also depend on you entire system; including both hardware and software sides. Think about the RAM memory bandwidth your machine has; it depends on the motherboard chipset, clock frequency of the modules, number of channels and modules, peak bandwidth your CPU can handle... etc, some of these parameters also affect the speed of memory transfers over the PCI.

I encourage you to measure the available bandwidth on your system. You can use the stream benchmark to measure the RAM memory bandwidth and the bandwidth utility provided in the CUDA samples (utilities directory) to measure the CPU-GPU memory bandwidth over the PCI. That will give you an insight of your machines and a starting point for further comparisons.

Community
  • 1
  • 1
srodrb
  • 1,304
  • 13
  • 23