2

Update:

I've run this example with other systems. On an Intel i7-3630QM, Intel HD4000 and Radeon HD 7630M, all results are the same. With an i7-4700MQ / 4800MQ the results of the CPU are different when OpenCL or a 64 bit gcc is used from an 32 bit gcc. This is a result of the 64 bit gcc and OpenCl using SSE by default and the 32 bit gcc using 387 math, at least the 64 bit gcc produces the same results when -mfpmath=387 is set. So I have to read a lot more and experiment with x86 floating point. Thank you all for your answers.


I've run the Lorenz system example from "Programming CUDA and OpenCL: A case study using modern C++ libraries" for ten systems each on different OpenCL devices and am getting different results:

  1. Quadro K1100M (NVIDIA CUDA)

    R => x y z
    0.100000 => -0.000000 -0.000000 0.000000
    5.644444 => -3.519254 -3.519250 4.644452
    11.188890 => 5.212534 5.212530 10.188904
    16.733334 => 6.477303 6.477297 15.733333

    22.277779 => 3.178553 2.579687 17.946903
    27.822224 => 5.008720 7.753564 16.377680
    33.366669 => -13.381100 -15.252210 36.107887
    38.911114 => 4.256534 6.813675 23.838787
    44.455555 => -11.083726 0.691549 53.632290
    50.000000 => -8.624105 -15.728293 32.516193

  2. Intel(R) HD Graphics 4600 (Intel(R) OpenCL)

    R => x y z
    0.100000 => -0.000000 -0.000000 0.000000
    5.644444 => -3.519253 -3.519250 4.644451
    11.188890 => 5.212531 5.212538 10.188890
    16.733334 => 6.477320 6.477326 15.733339

    22.277779 => 7.246771 7.398651 20.735369
    27.822224 => -6.295782 -10.615027 14.646572
    33.366669 => -4.132523 -7.773201 14.292910
    38.911114 => 14.183139 19.582197 37.943520
    44.455555 => -3.129006 7.564254 45.736408
    50.000000 => -9.146419 -17.006729 32.976696

  3. Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz (Intel(R) OpenCL)

    R => x y z
    0.100000 => -0.000000 -0.000000 0.000000
    5.644444 => -3.519254 -3.519251 4.644453
    11.188890 => 5.212513 5.212507 10.188900
    16.733334 => 6.477303 6.477296 15.733332

    22.277779 => -8.295195 -8.198518 22.271002
    27.822224 => -4.329878 -4.022876 22.573458
    33.366669 => 9.702943 3.997370 38.659538
    38.911114 => 16.105495 14.401397 48.537579
    44.455555 => -12.551083 -9.239071 49.378693
    50.000000 => 7.377638 3.447747 47.542763

As you can see, the three devices agree on the values up to R=16.733334 and then start to diverge.

I have run the same region with odeint without VexCL and get results close to the outcome of the OpenCL on CPU run:

Vanilla odeint:

R => x y z
16.733334 => 6.47731 6.47731 15.7333
22.277779 =>  -8.55303 -6.72512 24.7049
27.822224 => 3.88874 3.72254 21.8227

The example code can be found here: https://github.com/ddemidov/gpgpu_with_modern_cpp/blob/master/src/lorenz_ensemble/vexcl_lorenz_ensemble.cpp

I'm not sure what I am seeing here? Since the CPU results are so close to each other, it looks like an issue with the GPUs, but since I am an OpenCL newbie I need some pointers how to find the underlying cause of this.

KindDragon
  • 6,558
  • 4
  • 47
  • 75
ergo
  • 23
  • 3
  • Looks like a rounding/precision issue. Have you verified that the data width of all used types (float, double etc.) is the same? – Stefan May 22 '14 at 11:42
  • I've run this with everything in 'float' for the Quadro, HD4600 and the CPU and with everything in 'double' for the Quadro and the CPU (the HD4600 has no double precision support). The outcome is the same. Shouldn't OpenCL FP types be IEEE754 compliant anyway? – ergo May 22 '14 at 12:23

2 Answers2

1

You have to understand the GPUs have lower accuracy than CPUs. This is usual since a GPU is designed for gaming, where exact values is not the design target.

Usually GPU accuracy is 32 bits. While CPUs have internally a 48 or 64 bits accuracy math, even if the result is then cut to 32 bits storage.


The operation you are running is heavily dependent on these small differences, creating different results for each device. For example this operation will as well create very different results based on accuracy:

a=1/(b-c); 
a=1/(b-c); //b = 1.00001, c = 1.00002  -> a = -100000
a=1/(b-c); //b = 1.0000098, c = 1.000021  -> a = -89285.71428

In you own results, you can see the different for each device, even for low R values:

5.644444 => -3.519254 -3.519250 4.644452
5.644444 => -3.519253 -3.519250 4.644451
5.644444 => -3.519254 -3.519251 4.644453

However you state "for low values the results agree up to R=16, then start to diverge". Well, that depends, because they are not exactly equal, even for R=5.64.

DarkZeros
  • 8,235
  • 1
  • 26
  • 36
  • The GPUs are also running with double precision (64 Bit) in this example. However, there is some truth in this answer because FP operations on GPUs can give different results than CPU (e.g. http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf). And as the trajectory computed in this example is chaotic, small errors blow up exponentially and eventually lead completely different results. – mariomulansky May 22 '14 at 12:31
  • The system should not be chaotic for R < 23 and you should have a fixed point. So, the results still look quite strange. At least it is not clear why they start to diverge at R=16.* – headmyshoulder May 22 '14 at 15:34
  • You can play around with this little demo to see what I mean: https://gist.github.com/headmyshoulder/8a97787d63da8b8ffa32 . So, the result should only diverge for R values larger then 24. – headmyshoulder May 22 '14 at 15:42
  • I fear that this answer is not correct. As stated above, the different values can only be explained for R>24. For smaller values the system has a fixed point and this point should be nearly independent of the floating point unit. – headmyshoulder May 22 '14 at 15:54
  • In your example code you run 100000 of iterations, each of them with multiple sums, multiplications, etc. I don't see why you state that it should give the same results. Even a 0.00001 difference can produce completely different results after so many operations. – DarkZeros May 22 '14 at 16:54
  • What you state is absolutely correct if your ODE is in an chaotic state. Then small deviations (for example due to different floating point units (FPUs)) will lead to completely different results. However in the Lorenz system for R<24 you have a fixed point. This means that whenever the values of x are away from the fixed point the system will slowly converge to that fixed point. The fixed point is independent of the FPU. Hence, one will observe the fixed point for all devices. The results of @ddemidov look consistent, maybe there is another small problem with the code of the OP: – headmyshoulder May 23 '14 at 06:00
0

I've created a stackoverflow-23805423 branch to test this. Below is the output for different devices. Note that both CPUs and AMD GPU have consistent results. Nvidia GPUs also have consistent results, only those are different. This question seems to be related: IEEE-754 standard on NVIDIA GPU (sm_13)

```

1. Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz (Intel(R) OpenCL)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  9.392907e-01  1.679711e+00  1.455276e+01) (  5.351486e+00  1.051580e+01  9.403333e+00)
     6: ( -1.287673e+01 -2.096754e+01  2.790419e+01) ( -6.555650e-01 -2.142401e+00  2.721632e+01)
     8: (  2.711249e+00  2.540842e+00  3.259012e+01) ( -4.936437e+00  8.534876e-02  4.604861e+01)
}

1. Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz (AMD Accelerated Parallel Processing)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  9.392907e-01  1.679711e+00  1.455276e+01) (  5.351486e+00  1.051580e+01  9.403333e+00)
     6: ( -1.287673e+01 -2.096754e+01  2.790419e+01) ( -6.555650e-01 -2.142401e+00  2.721632e+01)
     8: (  2.711249e+00  2.540842e+00  3.259012e+01) ( -4.936437e+00  8.534876e-02  4.604861e+01)
}

1. Capeverde (AMD Accelerated Parallel Processing)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  9.392907e-01  1.679711e+00  1.455276e+01) (  5.351486e+00  1.051580e+01  9.403333e+00)
     6: ( -1.287673e+01 -2.096754e+01  2.790419e+01) ( -6.555650e-01 -2.142401e+00  2.721632e+01)
     8: (  2.711249e+00  2.540842e+00  3.259012e+01) ( -4.936437e+00  8.534876e-02  4.604861e+01)
}

1. Tesla C1060 (NVIDIA CUDA)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  7.636878e+00  2.252859e+00  2.964935e+01) (  1.373357e+01  8.995382e+00  3.998563e+01)
     6: (  7.163476e+00  8.802735e+00  2.839662e+01) ( -5.536365e+00 -5.997181e+00  3.191463e+01)
     8: ( -2.762679e+00 -5.167883e+00  2.324565e+01) (  2.776211e+00  4.734162e+00  2.949507e+01)
}

1. Tesla K20c (NVIDIA CUDA)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  7.636878e+00  2.252859e+00  2.964935e+01) (  1.373357e+01  8.995382e+00  3.998563e+01)
     6: (  7.163476e+00  8.802735e+00  2.839662e+01) ( -5.536365e+00 -5.997181e+00  3.191463e+01)
     8: ( -2.762679e+00 -5.167883e+00  2.324565e+01) (  2.776211e+00  4.734162e+00  2.949507e+01)
}

1. Tesla K40c (NVIDIA CUDA)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  7.636878e+00  2.252859e+00  2.964935e+01) (  1.373357e+01  8.995382e+00  3.998563e+01)
     6: (  7.163476e+00  8.802735e+00  2.839662e+01) ( -5.536365e+00 -5.997181e+00  3.191463e+01)
     8: ( -2.762679e+00 -5.167883e+00  2.324565e+01) (  2.776211e+00  4.734162e+00  2.949507e+01)
}

```

Community
  • 1
  • 1
ddemidov
  • 1,731
  • 13
  • 15