3

I have some matlab code that uses several large MEX functions and I want to speed things up by using openCL ( I am replacing parts of code of the MEX functions with openCL code using openCL API ). I've translated a small part of the code into an openCL kernel and I am already facing difficulties.

Some elements of the resulting matrix after execution on GPU are different from the corresponding elements of the resulting matrix when the original MEX function is called and the error is less than 0.01. This leads to a small error in the final result but I fear the error will accumulate as I translate more code.

This is probably related with different precision of the calculations on CPU and GPU. Does anyone know how to ensure the same precision? I am running 64 bit matlab R2012b on Ubuntu 12.04. The hardware I am using is Intel Core2 Duo E4700 and NVIDIA GeForce GT 520.

2 Answers2

3

The small differences between results on your CPU and GPU are easily explained as arising from differences in floating-point precision if you have modified your code from using double precision (64-bit) f-p numbers on the CPU to using single-precision (32-bit) f-p numbers on the GPU.

I would not call this difference an error, rather it is an artefact of doing arithmetic on computers with floating-point numbers. The results you were getting on your CPU-only code were already different from any theoretically 'true' result. Much of the art of numerical computing is in keeping the differences between theoretical and actual computations small enough (whatever the heck that means) for the entire duration of a computation. It would take more time and space than I have now to expand on this, but surprises arising from lack of understanding of what floating-point arithmetic is, and isn't, are a rich source of questions here on SO. Some of the answers to those questions are very illuminating. This one should get you started.

If you have taken care to use the same precision on both CPU and GPU then the differences you report may be explained by the non-commutativity of floating-point arithmetic: in floating-point arithmetic it is not guaranteed that (a+b)+c == a+(b+c). The order of operations matters; if you have any SIMD going on I'd bet that the order of operations is not identical on the two implementations. Even if you haven't, what have you done to ensure that operations are ordered the same on both GPU and CPU ?

As to what you should do about it, that's rather up to you. You could (though I personally wouldn't recommend it) write your own routines for doing double-precision f-p arithmetic on the GPU. If you choose to do this, expect to wave goodbye to much of the speed-up that the GPU promises.

A better course of action is to ensure that your single-precision software provides sufficient accuracy for your purposes. For example, in the world I work in our original measurements from the environment are generally not accurate to more than about 3 significant figures, so any results that our codes produce have no validity after about 3 s-f. So if I can keep the errors in the 5th and lower s-fs that's good enough.

Unfortunately, from your point of view, getting enough accuracy from single-precision computations isn't necessarily guaranteed by globally replacing double with float and reompiling, you may (generally would) need to implement different algorithms, ones which take more time to guarantee more accuracy and which do not drift so much as computations proceed. Again, you'll lose some of the speed advantage that GPUs promise.

Community
  • 1
  • 1
High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
2

A common problem is, that floating point values are kept within an 80bit CPU register, instead of getting truncated and stored each time. In these cases, the additional precision leads to deviations. So you may check, what options your compiler offers to counter such issues. It can also be interesting to view the difference of release and debug builds.

Sam
  • 7,778
  • 1
  • 23
  • 49