6

I guess I am in some really strange border case, maybe with double precision issues and I want to know, whats going on.

Inside an OpenCL Kernel I use:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__private int k = 2; // I need k to be an int, because I want to use as a counter
__private double s = 18;
__private double a = 1;

a = a/(double)k; // just to show, that I make in-place typecasting of k
a = k+1;
k = (int)a; //to show that I store k in a double buffer in an intermediate-step
if ((k-1)==2)
{
//    k = 3;
    s = pow(s/(double)(k-1),0.5);
}

This leads me to s = 2.999[...]6 However, if I uncomment the k=3 line, I get the (in my eyes) correct result s = 3. Why is that?

As a side information: The same behaviour doesn't happen when I do

s = sqrt(s/(double)(k-1))

Below follows the full, minimal Kernel and Host code for pyopencl

Kernel (Minima.cl):

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel void init_z(__global double * buffer)
{
    __private int x = get_global_id(0);
    __private int y = get_global_id(1);
    //w,h
    __private int w_y = get_global_size(1);
    __private int address = x*w_y+y;
    //h,w
    __private double init = 3.0;
    buffer[address]=init;
}

__kernel void root(__global double * buffer)
{
    __private int x = get_global_id(0);
    __private int y = get_global_id(1);
    //w,h
    __private int w_y = get_global_size(1);
    __private int address = x*w_y+y;
    //h,w
    __private double value = 18;
    __private int k;
    __private double out;
    k = (int) buffer[address];
  //k = 3;  If this line is uncommented, the result will be exact.
    out = pow(value/(double)(k-1), 0.5);
    buffer[address] = out;
}

Host:

import pyopencl as cl
import numpy as np


platform = cl.get_platforms()[0]
devs = platform.get_devices()
device1 = devs[1]
h_buffer = np.empty((10,10)).astype(np.float64)
mf = cl.mem_flags
ctx = cl.Context([device1])
Queue1 = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
Queue2 = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
mf = cl.mem_flags
m_dic = {0:mf.READ_ONLY,1:mf.WRITE_ONLY,2:mf.READ_WRITE}


fi = open('Minimal.cl', 'r')
fstr = "".join(fi.readlines())
prg = cl.Program(ctx, fstr).build()
knl = prg.init_z
knl.set_scalar_arg_dtypes([None,])
knl_root = prg.root
knl_root.set_scalar_arg_dtypes([None,])



def f():
    d_buffer =  cl.Buffer(ctx,m_dic[2], int(10 * 10  * 8))
    knl.set_args(d_buffer)
    knl_root.set_args(d_buffer)
    a = cl.enqueue_nd_range_kernel(Queue2,knl,(10,10),None)
    b = cl.enqueue_nd_range_kernel(Queue2,knl_root,(10,10),None, wait_for = [a,])
    cl.enqueue_copy(Queue1,h_buffer,d_buffer,wait_for=[b,])
    return h_buffer
a = f()
a[0,0] # Getting the result on the host.

Edit: Because of some more unclarities I update this question one more time. I understand, that the value of pow and sqrt doesn't have to be the same for the same input. My question is, why pow shows different output for the SAME input, depending on where I get it from.

The binaries are on pastebin: k_explicit and k_read

printf("a%\n", out) leads to 0x1.8p+1 with the k=3 line and to 0x1.7ffffffffffffp+1 when it's commented out.

Dschoni
  • 3,714
  • 6
  • 45
  • 80
  • R7-240 gives 3 for both situations. it could be that device's compiler's doing optimizations. something like putting the number 3 directly there since all can be known in compile time – huseyin tugrul buyukisik Feb 23 '17 at 17:24
  • 1
    I don't see an error there. s = s+-epsilon everything is fine. Thats how floating point works. – Kami Kaze Feb 23 '17 at 17:24
  • Actually k is not known at compile time. During Kernel calls k is written-to and read-from a double-buffer to implement a counter which is persistent over kernel calls. – Dschoni Feb 23 '17 at 17:36
  • @KamiKaze: That doesn't answer the question why for the same input the answer is different. – Dschoni Feb 23 '17 at 17:38
  • @Dschoni I tried with buffers this time(so no compile time optimizations), again it prints 3 for both cases. Maybe something to do with rounding modes? https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/SELECT_ROUNDING_MODE.html – huseyin tugrul buyukisik Feb 23 '17 at 17:40
  • I'm on a nvidia Titan X. Rounding modes could be possible. Is there any way to verify? – Dschoni Feb 23 '17 at 17:42
  • 5
    `pow` is much more general than `sqrt` and therefore almost certain to have worse accuracy when asked to do `sqrt`'s job. – zwol Feb 23 '17 at 17:47
  • Is [Use of floating point in the Linux kernel](http://stackoverflow.com/questions/13886338/use-of-floating-point-in-the-linux-kernel) relevant? – Weather Vane Feb 23 '17 at 17:48
  • @WeatherVane: I'm not entirely sure. Because context switching seems not to be the issue here. I'll try and write a self-contained kernel to deploy on different architectures to see and compare if hardware is an issue. – Dschoni Feb 23 '17 at 17:52
  • The general advice in that link seems to be: avoid using FPU in the kernel. – Weather Vane Feb 23 '17 at 18:05
  • 2
    @WeatherVane: Different concept of "kernel". The issues of using the FPU/Vector registers in an OS kernel do not apply for CUDA/openCL kernels. – EOF Feb 23 '17 at 18:10
  • I added a completely self-contained example that shows the behaviour. – Dschoni Feb 23 '17 at 18:19
  • could you look at nvidia control panel? there must be some setting near where you enable/disable fp64 cores – huseyin tugrul buyukisik Feb 23 '17 at 18:26
  • it says "Optimize for Compute Performance Off, On Windows 10, Maxwell GPUs and later. Offers significant improvement for some Compute applications. Care should be taken when turning this setting ON, as there can be unpredictable effects with some applications and graphics features." as in here: http://us.download.nvidia.com/Windows/378.49/378.49-nvidia-control-panel-quick-start-guide.pdf – huseyin tugrul buyukisik Feb 23 '17 at 18:30
  • I am on Ubuntu 16.04 LTS – Dschoni Feb 23 '17 at 18:38
  • "This leads me to s = 2.999[...]6" is not as useful as posting the code that printed it. Suggest `printf("%a\n", s);` At least then we would see _exactly_ what 2.999[...]6 is. Likely the same as `printf("%.16f\n", nextafter(3,0));` – chux - Reinstate Monica Feb 23 '17 at 19:29
  • @KamiKaze FP does not work to "s = s+-epsilon everything is fine", but to "s = s*(1 +/- epsilon) everything is fine". – chux - Reinstate Monica Feb 23 '17 at 19:34
  • @Dschoni Does `printf("%a\n", sqrt(9));` and `printf("%a\n", pow(9,0.5));` produce the same result? – chux - Reinstate Monica Feb 23 '17 at 19:36
  • `pow(double, double)` has minimum 16ulp so this is to be expected – kanna Feb 23 '17 at 20:42
  • Because of some strange printing issue of Nvidia (oh, how I hate them and their OpenCL policy...), I get my results in python on the host as a[0,0]. I'll add that to my question. – Dschoni Feb 24 '17 at 10:19
  • @chux: `printf("a%\n", out)` leads to `0x1.8p+1` with the `k=3` line and to `0x1.7ffffffffffffp+1` without it. – Dschoni Feb 24 '17 at 11:44
  • @Dschoni The answers are within 1 [ULP](https://en.wikipedia.org/wiki/Unit_in_the_last_place) of each other and your platform is simply using a lower performing `pow()`. Buy a better compiler/math library. – chux - Reinstate Monica Feb 24 '17 at 12:25
  • @chux AKA use an AMD implementation. ;) – Dschoni Feb 27 '17 at 09:34

2 Answers2

4

Floating point calculations are not exact. So using one algorithm (sqrt) and a different one (pow()) with the same inputs cannot be expected to give bitwise-identical results. If both results are within ±epsilon of the mathematically true value, then both implementations are acceptable.

Typically, pow() is implemented in terms of ln() and exp() (with a multiplication in between), whereas sqrt() can use a much faster implementation (which probably involves halving the mantissa as a first step).

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
  • we forgot to ask him which digit changes. there is only `s = 2.999[...]6` which could be 10th digit. khronos site says half_powr computes with ulp of 8192. – huseyin tugrul buyukisik Feb 23 '17 at 18:49
  • Ln is 4 ulp at most. https://www.khronos.org/registry/OpenCL/specs/opencl-2.1-environment.pdf – huseyin tugrul buyukisik Feb 23 '17 at 18:56
  • Again, this doesn't answer the question. In both cases, `k=3`. So with the same input to the same function (`pow`) I get two different results. Whereas with `sqrt` I get the same results. The difference only lies in where I get my `k` from. If I set it explicitly to 3, everything is fine. If I read in k from my array, it shows that behaviour. – Dschoni Feb 24 '17 at 10:23
  • 2
    @Dschoni You definitely don't know how evil an OpenCL driver can be. You'd better paste the compiled IL or ISA code. – BlueWanderer Feb 24 '17 at 10:34
  • I'll look into that and will deliver. So far I tried that on different devices and drivers and see the same behaviour everywhere. Although I didn't find the time to try on AMD devices yet ;) – Dschoni Feb 24 '17 at 10:38
  • @huseyintugrulbuyukisik it's 2.9999999999999996 (15 times 9) – Dschoni Feb 24 '17 at 11:09
  • @Dschoni 2.9999999999999996 should be fine since average decimal digits are 16 and pow having several ulps. Its just that gpu not obeying same ulp while not crossing limit ulp for both cases – huseyin tugrul buyukisik Feb 24 '17 at 11:24
  • @huseyintugrulbuyukisik still, that doesn't answer why it behaves differently for the same input. I added the link to binaries. – Dschoni Feb 24 '17 at 11:29
  • _which probably involves halving the mantissa as a first step_ - Should that be the exponent? – Ed Heal Feb 24 '17 at 11:47
  • 2
    @Dschoni - with your `k=3`, you're probably not using the same implementation of `pow` - I would expect that your compiler is optimising the call down to a constant. There's nothing to force the compiler's internal implementation to be bit-identical with that in the run-time library, as long as they both produce results within the acceptable range of error. Maybe you shouldn't even have mentioned `sqrt()` if that wasn't the point of the question. – Toby Speight Feb 24 '17 at 11:50
  • 1
    my cheapest amd card is working fine so it must be implementation defined optimization. – huseyin tugrul buyukisik Feb 24 '17 at 11:52
  • You are right, maybe this is confusig people. However as mentioned in my question `sqrt()` doesn't have this problem so I thought it could be worth mentioning. – Dschoni Feb 24 '17 at 11:53
  • @huseyintugrulbuyukisik Can you confirm, that on your AMD card, my code is producing exactly 3.0 in both cases? – Dschoni Feb 24 '17 at 12:03
  • I tested it now both on my Xeon CPU and my AMD 'Hawaii' Generation Card. Both work exactly as expected and return exactly 3.0 for all cases. So definitely implementation defined optimization. Again....NVIDIA.... – Dschoni Feb 24 '17 at 12:11
  • @Dschoni I'm trying convert_double_rte convert_double_rtz convert_double_rtn instead of (double) to be sure – huseyin tugrul buyukisik Feb 24 '17 at 12:23
  • @Dschoni exactly 3 for all cases. with explicit convert or simple casting, all same – huseyin tugrul buyukisik Feb 24 '17 at 12:26
  • @BlueWanderer: Just for the record. Binaries are up there on pastebin. If that gives any further inside. I can only say one is a lot longer than the other :) – Dschoni Feb 28 '17 at 14:27
0

So, after lots of discussion it doesn't seem to be a problem with my code. The behaviour seems to be hardware-dependent and so far only reproducable on Nvidia Hardware. I'm still interested in the "why" and "how" that happens but in this case the question is answered.

Dschoni
  • 3,714
  • 6
  • 45
  • 80