1

Fermi generation GPU's single precision calculation should be 2 times faster than double precision. However, although I rewrite all declaration 'double' to 'float', I got no speed up. Is there any mistake ex. compile option etc..?

GPU:Tesla C2075 OS:win7 pro Compiler:VS2013(nvcc) CUDA:v.7.5 Command line:nvcc test.cu

I wrote test code:

#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<time.h>
#include<conio.h>

#include<cuda_runtime.h>
#include<cuda_profiler_api.h> 
#include<device_functions.h>
#include<device_launch_parameters.h>

#define DOUBLE 1

#define MAXI 10

__global__ void Kernel_double(double*a,int nthreadx)
{
    double b=1.e0;
    int i;
    i = blockIdx.x * nthreadx + threadIdx.x + 0;
    a[i] *= b;
}
__global__ void Kernel_float(float*a,int nthreadx)
{
    float b=1.0F;
    int i;
    i = blockIdx.x * nthreadx + threadIdx.x + 0;
    a[i] *= b;
}

int main()
{
#if DOUBLE
    double a[10];
    for(int i=0;i<MAXI;++i){
        a[i]=1.e0;
    }
    double*d_a;
    cudaMalloc((void**)&d_a, sizeof(double)*(MAXI));
    cudaMemcpy(d_a, a, sizeof(double)*(MAXI), cudaMemcpyHostToDevice);
#else
    float a[10];
    for(int i=0;i<MAXI;++i){
        a[i]=1.0F;
    }
    float*d_a;
    cudaMalloc((void**)&d_a, sizeof(float)*(MAXI));
    cudaMemcpy(d_a, a, sizeof(float)*(MAXI), cudaMemcpyHostToDevice);
#endif

    dim3 grid(2, 2, 1);
    dim3 block(2, 2, 1);

    clock_t start_clock, end_clock;
    double sec_clock;

    printf("[%d] start\n", __LINE__);
    start_clock = clock();

    for (int i = 1; i <= 100000; ++i){
#if DOUBLE
        Kernel_double << < grid, block >> > (d_a, 2);
        cudaMemcpy(a, d_a, sizeof(double)*(MAXI), cudaMemcpyDeviceToHost);
#else
        Kernel_float << < grid, block >> > (d_a, 2);
        cudaMemcpy(a, d_a, sizeof(float)*(MAXI), cudaMemcpyDeviceToHost);
#endif
    }

    end_clock = clock();
    sec_clock = (end_clock - start_clock) / (double)CLOCKS_PER_SEC;
    printf("[%d] %f[s]\n", __LINE__, sec_clock);
    printf("[%d] end\n", __LINE__);

    return 0;
}
stg
  • 31
  • 4
  • 2
    If you say double is faster than single and then wonder why no speedup when using single, I don't really get the question. Also don't tag c code as c++ – Sami Kuhmonen May 04 '16 at 04:46
  • Oh..I mistaked.. double <-> single. Correctly, "Fermi generation GPU's single precision calculation should be 2 times faster than double precision." I'm very sorry. – stg May 04 '16 at 05:12
  • 2
    You count the memcpy (massive overhead), and your array is too small (10), meaning execution time must be hidden behind kernel spawn time – Regis Portalez May 04 '16 at 06:09
  • Thank you for your suggestion. But if cudaMemcpy was replaced out of the loop, it will be no difference between double and float. I think this problem is not rooted memcpy's overhead. and, array size is more large in original cord (over 100000).. – stg May 04 '16 at 06:25
  • 1
    @stg: 100000 floating point multiplies is a trivial amount of work for any GPU. Your example is just measuring latency, not floating point performance, because there is far too little work in the kernels you are using. – talonmies May 04 '16 at 07:02
  • @talonmies, should this example not be simplified-out by the optimizer, it would be a memory-bound problem. Hence, nothing related to "floating point multiplies". Traversing 100000 floats read-write would take about 5 microseconds (if done well), way below kernel launch and mem copy. – Florent DUGUET May 04 '16 at 08:16

1 Answers1

7

Well, after some investigation, that's because you just perform a multiplication by the constant 1, which gets optimized to "do nothing" in the binary:

enter image description here

If instead you square the array (to prevent this trivial optimization), you get the following assembly:

enter image description here

and the performance gains are restored on the below(simplified) piece of code, in which i changed a few things:

  • way larger array (100M)
  • using blockDim.x instead of an argument parameter
  • use better kernel configuration for my machine (GTX 980)
  • allocate input array on heap instead of stack (to allow more than 1M)

here is the code:

#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<time.h>
#include<conio.h>

#include<cuda_runtime.h>
#include<cuda_profiler_api.h> 
#include<device_functions.h>
#include<device_launch_parameters.h>

#define DOUBLE float

#define ITER 10
#define MAXI 100000000

__global__ void kernel(DOUBLE*a)
{
    for(int i = blockIdx.x * blockDim.x + threadIdx.x ; i < MAXI; i += blockDim.x * gridDim.x) 
    {
        a[i] *= a[i];
    }
}

int main()
{
    DOUBLE* a = (DOUBLE*) malloc(MAXI*sizeof(DOUBLE));
    for(int i=0;i<MAXI;++i)
    {
        a[i]=(DOUBLE)1.0;
    }
    DOUBLE* d_a;
    cudaMalloc((void**)&d_a, sizeof(DOUBLE)*(MAXI));
    cudaMemcpy(d_a, a, sizeof(DOUBLE)*(MAXI), cudaMemcpyHostToDevice);

    clock_t start_clock, end_clock;
    double sec_clock;

    printf("[%d] start\n", __LINE__);
    start_clock = clock();

    for (int i = 1; i <= ITER; ++i){
        kernel <<< 32, 256>>> (d_a);
    }
    cudaDeviceSynchronize();

    end_clock = clock();
    cudaMemcpy(a, d_a, sizeof(DOUBLE)*(MAXI), cudaMemcpyDeviceToHost);
    sec_clock = (end_clock - start_clock) / (double)CLOCKS_PER_SEC;
    printf("[%d] %f/%d[s]\n", __LINE__, sec_clock, CLOCKS_PER_SEC);
    printf("[%d] end\n", __LINE__);

    return 0;
} 

(You'll notice I allocate a array of length 100M to get measurable performance.)

Regis Portalez
  • 4,675
  • 1
  • 29
  • 41
  • 1
    This code is memory-bound. In double precision, you have 64 DP units on your GTX 980, 1 FMA per cycle @ 1.216 GHZ => 156 GFLOPS. The bandwidth reads 224 GB/s. This leads to 5.57 FLOPS per memory operation (with theoretical peak). Expressing anything on DP-flops for this card requires at least 6 FLOPS (well, 3 multiplies should do as FLOPS account for FMA) per memory operation, you have 1 here. You are measuring bandwidth. (Moreover, getting peak bandwidth is hard, hence you might need even more FLOPS). – Florent DUGUET May 04 '16 at 08:12
  • But the first part of the answer points out what really goes wrong : the optimization during compilation that removes the useless computation. :) – Taro May 04 '16 at 08:15
  • Thanks all, now I understood about optimization. @Regis code float is 2x faster than double indeed, what is the meaning of this loop? If I rewrite kernel without loop (while `int i = blockIdx.x * blockDim.x + threadIdx.x ;` is left) with changing ITER 100000, there is no difference double and float. Memory-bound problem..? Umm..I'm new in CUDA, I have to study more..thanks. – stg May 04 '16 at 09:24
  • you don't loop, you won't iterate through all elements. The meaning of the loop is to iterate through all elements in a [coalesced way](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory). – Regis Portalez May 04 '16 at 10:06