Cuda programming cannot have the same computation accuracy compared with CPU program in terms of float type

Question

I try to use the GPU to accelerate my program which computes L2 distance between two float array. In order to check the computation accuracy, I write both CUDA program and CPU program. However, I found that the total error is more than 200, which I do not understand. I use float type in both cases and I believe that I should get the same result. My code is as the following.

#include <cuda_runtime.h>
#include <stdio.h>
#include <sys/time.h>
#include <math.h>
// #include <helper_functions.h>

#define VECTORDIM 3


double cpuSecond()
{
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return ((double) tp.tv_sec + (double)tp.tv_usec*1e-6);
}

void DistanceCPU(float* array1, float* array2, int narray1, int narray2, float* output)
{
    float temp;
    for (int i = 0; i < narray1; i++)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
            for (int l = 0; l < VECTORDIM; l++)
            {
                temp += powf(array1[i + l * narray1] - array2[j + l * narray2], 2); 
            }
            output[i * narray2 + j] = temp;
        }
    }
}
__global__ void DistGPU(float* array1, float* array2, int narray1, int narray2, float* output)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    float temp;

    if (i < narray1)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
            temp += powf(array1[i] - array2[j], 2);
            temp += powf(array1[i + narray1] - array2[j + narray2], 2);
            temp += powf(array1[i + 2 * narray1] - array2[j + 2 * narray2], 2);
            output[i * narray2 + j] = temp;
        }
    }
}

int main()
{
    int narray1 = 7000;
    int narray2 = 60000;

    float* array1 = new float[narray1 * VECTORDIM];
    float* array2 = new float[narray2 * VECTORDIM];
    float* outputGPU = new float[narray1 * narray2];
    float* outputCPU = new float[narray1 * narray2];
    float* outputCPUTest = new float[narray1 * narray2];

    float* d_array1;
    float* d_array2;
    float* d_output;

    for (int i = 0; i < narray1 * VECTORDIM; i++)
    {
        array1[i] = static_cast<float> (rand() / (static_cast<float> (RAND_MAX / 10)));
        // std::cout << "Element " << i << " " << array1[i] << std::endl;
    }

    for (int i = 0; i < narray2 * VECTORDIM; i++)
    {
        array2[i] = static_cast<float> (rand() / (static_cast<float> (RAND_MAX / 10)));
    }

    cudaError_t err;

    err = cudaMalloc((void**)&d_array1, narray1 * VECTORDIM * sizeof(float));
    err = cudaMalloc((void**)&d_array2, narray2 * VECTORDIM * sizeof(float));
    err = cudaMalloc((void**)&d_output, narray1 * narray2 * sizeof(float));

    err = cudaMemcpy(d_array1, array1, narray1 * VECTORDIM * sizeof(float), cudaMemcpyHostToDevice);
    err = cudaMemcpy(d_array2, array2, narray2 * VECTORDIM * sizeof(float), cudaMemcpyHostToDevice);

    int threadsPerBlock = 512;
    int blocksPerGrid = (narray1 + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    double iStart = cpuSecond();
    DistGPU<<<blocksPerGrid, threadsPerBlock>>>(d_array1, d_array2, narray1, narray2, d_output);
    double iElaps = cpuSecond() - iStart;

    err = cudaMemcpy(outputGPU, d_output, narray1 * narray2 * sizeof(float), cudaMemcpyDeviceToHost);

    printf("Total computation time is %lf \n" , iElaps);

    DistanceCPU(array1, array2, narray1, narray2, outputCPU);

    float error = 0;
    for (long i = 0; i < narray1 * narray2; i++)
    {
        error += abs(outputCPU[i] - outputGPU[i]);
    }
    error /= (narray2 * narray1);

    for (int i = 0; i < 20; i++)
    {
        printf("CPU result %f \n", outputCPU[i]);
        printf("GPU result %f \n", outputGPU[i]);
    }

    printf("Error is %f \n", error);
    delete [] array1;
    delete [] array2;
    delete [] outputCPU;
    delete [] outputGPU;
    return 0;
}

I try to print some computation results from both CPU and GPU. I get the following output.

CPU result 84.315201 
GPU result 84.315193 
CPU result 48.804039 
GPU result 48.804039 
CPU result 26.388403 
GPU result 26.388403 
CPU result 150.009735 
GPU result 150.009750

I believe that the float accuracy is enough and I do not know what the real problem is.

Possible duplicate of [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) — Matthieu Brucher, Mar 14 '19 at 11:10
Different platforms, different compilers, different precisions (conformance to IEEE), different optimizations... — Matthieu Brucher, Mar 14 '19 at 11:11
@MatthieuBrucher I do not fully understand that. I use the same computer and the same compiler. — Sean, Mar 14 '19 at 11:16
@Sean Not the same compiler. Cuda code is compiled with `nvcc` while C is with gcc or visual studio. You are playing withing accuracy though, `eps(150.0f)` is around 1.5e-5. — Ander Biguri, Mar 14 '19 at 11:18
@AnderBiguri I use the nvcc to compile my code. I am not sure what you mean. I write the C code into cu file. — Sean, Mar 14 '19 at 11:19
@Sean That is not how it works. `nvcc` uses whichever C++ compiler you have for the CPU code, thats how it works. You are also running it in different hardware. Its the same computer, but the CPU code is running in a CPU with its hardware, and the CUDA code is running in a graphics card. You are executing the code physically in a different location — Ander Biguri, Mar 14 '19 at 11:20
@AnderBiguri I agree with you that I run the code based on different hardware. So what you mean is that the difference floating point standard with respect to CPU and GPU causes the problem. Is that right? — Sean, Mar 14 '19 at 11:24
@Sean Not really. Its not the standard, but the optimization the compiler may do. Ultimately all your code gets translated to assembly instructions, and as floating point arithmetic is not commutative, unless the final version of the CPU and GPU code execute instruction in the same order, they will be different. Different withing acceptable values, as I said, the precission of a float at 150 value is in the 5th decimal, lower than that is just "noise". The difference you see is just 1 bit in the binary representation, its as small as it can get. — Ander Biguri, Mar 14 '19 at 11:27
@AnderBiguri: Floating-point arithmetic is commutative, for operations where the corresponding mathematical operation is commutative. Perhaps you mean it is not associative. — Eric Postpischil, Mar 14 '19 at 12:35
@EricPostpischil I did indeed, just used the wrong word. Thanks for the correction ;) — Ander Biguri, Mar 14 '19 at 13:26
Is a relative error of 1e-7 actually significant enough to worry about? And anyone squaring numbers with `powf` has a *lot* to learn about how floating point arithmetic works — talonmies, Mar 14 '19 at 15:00
@Sean The reproducer code demonstrates that the results from the CPU and the GPU are *different*. I don't see anything that establishes which of those results is *more accurate*. — njuffa, Mar 14 '19 at 18:06
@talonmies Is getting an incorrect result enough to worry about? By definition, yes. — curiousguy, Jun 23 '19 at 18:21

Robert Crovella · Accepted Answer · 2019-03-14T18:37:47.640

I would say the principal contributor here is the use of the powf function. A particular math function definition on the GPU is not guaranteed to give the same accuracy as the same math function in CPU code. Whether that is a sufficient or even applicable description here, I cannot say, because we'd probably have to have a discussion around what CPU compiler you are using as well as compile switches/settings. The error possibilities for the GPU math functions are documented in the CUDA programming guide.

However, there's really not much point in my opinion to use pow or powf to square things, if what you're interested in is performance. I assume since you're asking a question about GPUs, you are interested in performance.

If we replace the use of the powf function with an ordinary squaring operation, the GPU results become much closer to the CPU results, from what I can see.

Results from running the code as is, on CUDA 10.0, Tesla P100, CentOS 7, gcc 4.8.5:

$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000038
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970345
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145151
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315193
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009750
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092339
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000001

Modified code, replacing powf with ordinary squaring:

$ cat t415.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <sys/time.h>
#include <math.h>
// #include <helper_functions.h>

#define VECTORDIM 3
typedef float mt;

double cpuSecond()
{
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return ((double) tp.tv_sec + (double)tp.tv_usec*1e-6);
}

void DistanceCPU(mt* array1, mt* array2, int narray1, int narray2, mt* output)
{
    mt temp;
    for (int i = 0; i < narray1; i++)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
            for (int l = 0; l < VECTORDIM; l++)
            {
#ifndef USE_POW
                temp += (array1[i + l * narray1] - array2[j + l * narray2])*(array1[i + l * narray1] - array2[j + l * narray2]);
#else
                temp += powf(array1[i + l * narray1] - array2[j + l * narray2], 2);
#endif
            }
            output[i * narray2 + j] = temp;
        }
    }
}
__global__ void DistGPU(mt* array1, mt* array2, int narray1, int narray2, mt* output)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    mt temp;

    if (i < narray1)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
#ifndef USE_POW
            temp += (array1[i] - array2[j])*(array1[i] - array2[j]);
            temp += (array1[i + narray1] - array2[j + narray2])*(array1[i + narray1] - array2[j + narray2]);
            temp += (array1[i + 2 * narray1] - array2[j + 2 * narray2])*(array1[i + 2 * narray1] - array2[j + 2 * narray2]);
#else
            temp += powf(array1[i] - array2[j], 2);
            temp += powf(array1[i + narray1] - array2[j + narray2], 2);
            temp += powf(array1[i + 2 * narray1] - array2[j + 2 * narray2], 2);
#endif
            output[i * narray2 + j] = temp;
        }
    }
}

int main()
{
    int narray1 = 7000;
    int narray2 = 60000;

    mt* array1 = new mt[narray1 * VECTORDIM];
    mt* array2 = new mt[narray2 * VECTORDIM];
    mt* outputGPU = new mt[narray1 * narray2];
    mt* outputCPU = new mt[narray1 * narray2];
    mt* outputCPUTest = new mt[narray1 * narray2];

    mt* d_array1;
    mt* d_array2;
    mt* d_output;

    for (int i = 0; i < narray1 * VECTORDIM; i++)
    {
        array1[i] = static_cast<mt> (rand() / (static_cast<mt> (RAND_MAX / 10)));
        // std::cout << "Element " << i << " " << array1[i] << std::endl;
    }

    for (int i = 0; i < narray2 * VECTORDIM; i++)
    {
        array2[i] = static_cast<mt> (rand() / (static_cast<mt> (RAND_MAX / 10)));
    }

    cudaError_t err;

    err = cudaMalloc((void**)&d_array1, narray1 * VECTORDIM * sizeof(mt));
    err = cudaMalloc((void**)&d_array2, narray2 * VECTORDIM * sizeof(mt));
    err = cudaMalloc((void**)&d_output, narray1 * narray2 * sizeof(mt));

    err = cudaMemcpy(d_array1, array1, narray1 * VECTORDIM * sizeof(mt), cudaMemcpyHostToDevice);
    err = cudaMemcpy(d_array2, array2, narray2 * VECTORDIM * sizeof(mt), cudaMemcpyHostToDevice);

    int threadsPerBlock = 512;
    int blocksPerGrid = (narray1 + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    double iStart = cpuSecond();
    DistGPU<<<blocksPerGrid, threadsPerBlock>>>(d_array1, d_array2, narray1, narray2, d_output);
    double iElaps = cpuSecond() - iStart;

    err = cudaMemcpy(outputGPU, d_output, narray1 * narray2 * sizeof(mt), cudaMemcpyDeviceToHost);

    printf("Total computation time is %lf \n" , iElaps);

    DistanceCPU(array1, array2, narray1, narray2, outputCPU);

    mt error = 0;
    for (long i = 0; i < narray1 * narray2; i++)
    {
        error += abs(outputCPU[i] - outputGPU[i]);
    }
    error /= (narray2 * narray1);

    for (int i = 0; i < 20; i++)
    {
        printf("CPU result %f \n", outputCPU[i]);
        printf("GPU result %f \n", outputGPU[i]);
    }

    printf("Error is %f \n", error);
    delete [] array1;
    delete [] array2;
    delete [] outputCPU;
    delete [] outputGPU;
    return 0;
}
$ nvcc -o t415 t415.cu
t415.cu(87): warning: variable "err" was set but never used

$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000042
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970348
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145149
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315201
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009735
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092331
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000000

Some notes:

There are still some differences which I haven't investigated. The GPU may do FMA contraction differently than CPU code. The next step in the analysis process would be to compare float vs. double computations, to establish a baseline of what number is closer to the correct result. There are situations where the GPU produces a number that is closer to the correct result than the corresponding CPU code, so simply making the assumption that the CPU code is correct and then asking for an explanation of why the GPU code is different is not always the correct approach. Here is an example of this kind of mistake.
If we consider the ordinary-squaring version, its not obvious to me that this code does or would need to have floating-point calculation ordering differences between CPU and GPU versions, therefore I don't think floating-point (lack of) associativity is a primary consideration here. However, I don't have a conclusive explanation to explain the remaining differences; more work would be needed (see previous item).
At least on the GPU, ordinary squaring is likely to be faster than powf( ,2)
Your timing measurement on the GPU code is only capturing the kernel launch overhead. Kernel launches are asynchronous. To capture the full kernel execution duration, add a cudaDeviceSynchronize(); call in the timing region, immediately after the kernel call.

EDIT: Courtesy of @njuffa, who reminded me it's easy to check the FMA contraction hypothesis, if we recompile the previously modified code with -fmad=false, then we observe (at least as far as the print-out goes) identical results between GPU and CPU. So this means FMA contraction (on the GPU side) is likely the final contributor to the few differences remaining in the previous section. As mentioned in the comment by njuffa, FMA contraction is likely to produce higher accuracy results, and so a possible explanation here is that the GPU results (with FMA contraction, as displayed previously) are probably more accurate than the CPU results. Again, switching to double-precision would help to confirm this. The code is already set up to make that easily possible with a typedef change. In any event, here is the output of the previous code (float, with use of ordinary squaring) with -fmad=false:

$ nvcc -o t415 t415.cu -fmad=false
t415.cu(87): warning: variable "err" was set but never used

$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000039
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970348
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145151
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315201
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009735
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092339
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000000

To check the FMA (fused multiply-add) contraction aspect, one can turn that off with the compiler switch `-fmad=false`. Note: This will usually make results *less* accurate, and it will often make code slower. — njuffa, Mar 14 '19 at 18:04
With -fmad=false the results are actually bitequal. Interestingly, the results are also bitequal when compiled with `-fmad=true -Xcompiler '-mfma -ffp-contract=fast -O3'`, though this'll depend on compiler inlining decisions. Is there even a way to enable non-assoc math for NVCC? There are [seemingly no flags](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#use-fast-math-use-fast-math) for it unlike for [clang](https://clang.llvm.org/docs/UsersManual.html#cmdoption-ffast-math). This [IEEE doc](https://docs.nvidia.com/cuda/floating-point/index.html) doesn't mention it either. — Simon Boehm, Jul 10 '23 at 17:23
Answering my own question: I asked an Nvidia engineer and read through the [nvcc](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html) docs: nvcc does not re-associate floating point math ops and there isn't even a flag to tell it to do so. — Simon Boehm, Jul 11 '23 at 20:52

Cuda programming cannot have the same computation accuracy compared with CPU program in terms of float type

1 Answers1