Why my cuda C code does not become faster with single precision?

Question

Fermi generation GPU's single precision calculation should be 2 times faster than double precision. However, although I rewrite all declaration 'double' to 'float', I got no speed up. Is there any mistake ex. compile option etc..?

GPU:Tesla C2075 OS:win7 pro Compiler:VS2013(nvcc) CUDA:v.7.5 Command line:nvcc test.cu

I wrote test code:

#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<time.h>
#include<conio.h>

#include<cuda_runtime.h>
#include<cuda_profiler_api.h> 
#include<device_functions.h>
#include<device_launch_parameters.h>

#define DOUBLE 1

#define MAXI 10

__global__ void Kernel_double(double*a,int nthreadx)
{
    double b=1.e0;
    int i;
    i = blockIdx.x * nthreadx + threadIdx.x + 0;
    a[i] *= b;
}
__global__ void Kernel_float(float*a,int nthreadx)
{
    float b=1.0F;
    int i;
    i = blockIdx.x * nthreadx + threadIdx.x + 0;
    a[i] *= b;
}

int main()
{
#if DOUBLE
    double a[10];
    for(int i=0;i<MAXI;++i){
        a[i]=1.e0;
    }
    double*d_a;
    cudaMalloc((void**)&d_a, sizeof(double)*(MAXI));
    cudaMemcpy(d_a, a, sizeof(double)*(MAXI), cudaMemcpyHostToDevice);
#else
    float a[10];
    for(int i=0;i<MAXI;++i){
        a[i]=1.0F;
    }
    float*d_a;
    cudaMalloc((void**)&d_a, sizeof(float)*(MAXI));
    cudaMemcpy(d_a, a, sizeof(float)*(MAXI), cudaMemcpyHostToDevice);
#endif

    dim3 grid(2, 2, 1);
    dim3 block(2, 2, 1);

    clock_t start_clock, end_clock;
    double sec_clock;

    printf("[%d] start\n", __LINE__);
    start_clock = clock();

    for (int i = 1; i <= 100000; ++i){
#if DOUBLE
        Kernel_double << < grid, block >> > (d_a, 2);
        cudaMemcpy(a, d_a, sizeof(double)*(MAXI), cudaMemcpyDeviceToHost);
#else
        Kernel_float << < grid, block >> > (d_a, 2);
        cudaMemcpy(a, d_a, sizeof(float)*(MAXI), cudaMemcpyDeviceToHost);
#endif
    }

    end_clock = clock();
    sec_clock = (end_clock - start_clock) / (double)CLOCKS_PER_SEC;
    printf("[%d] %f[s]\n", __LINE__, sec_clock);
    printf("[%d] end\n", __LINE__);

    return 0;
}

If you say double is faster than single and then wonder why no speedup when using single, I don't really get the question. Also don't tag c code as c++ — Sami Kuhmonen, May 04 '16 at 04:46
Oh..I mistaked.. double <-> single. Correctly, "Fermi generation GPU's single precision calculation should be 2 times faster than double precision." I'm very sorry. — stg, May 04 '16 at 05:12
You count the memcpy (massive overhead), and your array is too small (10), meaning execution time must be hidden behind kernel spawn time — Regis Portalez, May 04 '16 at 06:09
Thank you for your suggestion. But if cudaMemcpy was replaced out of the loop, it will be no difference between double and float. I think this problem is not rooted memcpy's overhead. and, array size is more large in original cord (over 100000).. — stg, May 04 '16 at 06:25
@stg: 100000 floating point multiplies is a trivial amount of work for any GPU. Your example is just measuring latency, not floating point performance, because there is far too little work in the kernels you are using. — talonmies, May 04 '16 at 07:02
@talonmies, should this example not be simplified-out by the optimizer, it would be a memory-bound problem. Hence, nothing related to "floating point multiplies". Traversing 100000 floats read-write would take about 5 microseconds (if done well), way below kernel launch and mem copy. — Florent DUGUET, May 04 '16 at 08:16

score 7 · Accepted Answer · answered May 04 '16 at 07:17

Well, after some investigation, that's because you just perform a multiplication by the constant 1, which gets optimized to "do nothing" in the binary:

If instead you square the array (to prevent this trivial optimization), you get the following assembly:

and the performance gains are restored on the below(simplified) piece of code, in which i changed a few things:

way larger array (100M)
using blockDim.x instead of an argument parameter
use better kernel configuration for my machine (GTX 980)
allocate input array on heap instead of stack (to allow more than 1M)

here is the code:

#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<time.h>
#include<conio.h>

#include<cuda_runtime.h>
#include<cuda_profiler_api.h> 
#include<device_functions.h>
#include<device_launch_parameters.h>

#define DOUBLE float

#define ITER 10
#define MAXI 100000000

__global__ void kernel(DOUBLE*a)
{
    for(int i = blockIdx.x * blockDim.x + threadIdx.x ; i < MAXI; i += blockDim.x * gridDim.x) 
    {
        a[i] *= a[i];
    }
}

int main()
{
    DOUBLE* a = (DOUBLE*) malloc(MAXI*sizeof(DOUBLE));
    for(int i=0;i<MAXI;++i)
    {
        a[i]=(DOUBLE)1.0;
    }
    DOUBLE* d_a;
    cudaMalloc((void**)&d_a, sizeof(DOUBLE)*(MAXI));
    cudaMemcpy(d_a, a, sizeof(DOUBLE)*(MAXI), cudaMemcpyHostToDevice);

    clock_t start_clock, end_clock;
    double sec_clock;

    printf("[%d] start\n", __LINE__);
    start_clock = clock();

    for (int i = 1; i <= ITER; ++i){
        kernel <<< 32, 256>>> (d_a);
    }
    cudaDeviceSynchronize();

    end_clock = clock();
    cudaMemcpy(a, d_a, sizeof(DOUBLE)*(MAXI), cudaMemcpyDeviceToHost);
    sec_clock = (end_clock - start_clock) / (double)CLOCKS_PER_SEC;
    printf("[%d] %f/%d[s]\n", __LINE__, sec_clock, CLOCKS_PER_SEC);
    printf("[%d] end\n", __LINE__);

    return 0;
}

(You'll notice I allocate a array of length 100M to get measurable performance.)

This code is memory-bound. In double precision, you have 64 DP units on your GTX 980, 1 FMA per cycle @ 1.216 GHZ => 156 GFLOPS. The bandwidth reads 224 GB/s. This leads to 5.57 FLOPS per memory operation (with theoretical peak). Expressing anything on DP-flops for this card requires at least 6 FLOPS (well, 3 multiplies should do as FLOPS account for FMA) per memory operation, you have 1 here. You are measuring bandwidth. (Moreover, getting peak bandwidth is hard, hence you might need even more FLOPS). — Florent DUGUET, May 04 '16 at 08:12
But the first part of the answer points out what really goes wrong : the optimization during compilation that removes the useless computation. :) — Taro, May 04 '16 at 08:15
Thanks all, now I understood about optimization. @Regis code float is 2x faster than double indeed, what is the meaning of this loop? If I rewrite kernel without loop (while `int i = blockIdx.x * blockDim.x + threadIdx.x ;` is left) with changing ITER 100000, there is no difference double and float. Memory-bound problem..? Umm..I'm new in CUDA, I have to study more..thanks. — stg, May 04 '16 at 09:24
you don't loop, you won't iterate through all elements. The meaning of the loop is to iterate through all elements in a [coalesced way](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory). — Regis Portalez, May 04 '16 at 10:06

Why my cuda C code does not become faster with single precision?

1 Answers1