0

I'm doing Jacobi iteration method on OpenCL. When my kernel takes in a matrix bigger than 200 in dimension, the floating (aka, cl_float, aka, float) overflows.

On host side (CPU), there's big number and multiprecision libraries which might help on this issue, however, on GPU side, doing big number and multiprecision is not easy.

How do you solve this problem? Is there any clues?

My code for reference:

inline float dotRow(__global float * mat, __global float * vec, uint col) {
    float ret = 0;
    for(uint i = 0; i < col; ++i)
        ret += mat[i] * vec[i];

    return ret;
}//dotRow(mat, vec, col)

__kernel void jacobi(
        const uint itemsPerThread,
        uint maxIteration,
        __global bool * cont,
        __global float * outX,
        __local float * outXOld,
        __global float * matA, __global float * vecB,
        uint row, uint col)
{
    int itemOffset = get_global_id(0) * itemsPerThread;

    for(uint iter = 0; iter < maxIteration; ++iter) {
        // preserve the current outX
        for(uint i = itemOffset; i < itemOffset + itemsPerThread; ++i)
            outXOld[i] = outX[i];

        barrier(CLK_LOCAL_MEM_FENCE);

        *cont = false;
        bool contPrivate = false;
        // calculate one iteration
        for(uint i = itemOffset; i < itemOffset + itemsPerThread; ++i) {
            if(i >= row)
                break;

            float s1 = dotRow(matA+(i*col)+0,     outX+0,     i);
            float s2 = dotRow(matA+(i*col)+(i+1), outX+(i+1), col-(i+1));
            outX[i] = (vecB[i] - s1 - s2) / matA[i*col+i];

            if(fabs(outX[i] - outXOld[i]) > 1e-10f)
                contPrivate = true;
        }//for
        if(contPrivate)
            *cont = true;

        barrier(CLK_GLOBAL_MEM_FENCE);

        if(*cont == false) {
            break;
        }//if
    }//for
}

Thanks to @aland 's hint for emulation more precision using float. Here's a ref about this (if useful for anyone): Emulate "double" using 2 "float"s

Community
  • 1
  • 1
Adam
  • 1,684
  • 1
  • 19
  • 39
  • Using cl_double is not available on every GPU, so it might not be a good approach. – Adam Oct 27 '14 at 05:58
  • Several option: (1) fixed precision, if all the values are well within specific range; (2) software implementation of double precision (see DSFUN90 [here](http://crd-legacy.lbl.gov/~dhbailey/mpdist/) for fortran examples, the porting to opencl seems straightforward); (3) while `cl_khr_fp64` is not available on *every* GPU, it's available on most GPUs that have decent performance. – aland Oct 27 '14 at 07:25
  • @aland double won't work for matrices bigger than 100,000 in dimension as well. – Adam Oct 27 '14 at 10:19
  • You can divide the work into smaller batches for the gpu, and do the final summation on the host. This still relies on the numbers staying reasonably small within your batches. If you detect an overflow looming, you can do all of the work on the host. Definitely check the GPU for double precision support as well. GPUs aren't meant for all workloads. – mfa Oct 27 '14 at 13:22
  • 1
    @aland Thanks for the emulation hint. More info maybe available here: http://stackoverflow.com/questions/6769881/emulate-double-using-2-floats . – Adam Oct 27 '14 at 14:47
  • @mfa I wonder whether there is a great algorithm for multiprecision on GPU. – Adam Oct 27 '14 at 14:48
  • 1
    There are some good posts by Eric Bainville about various computing, both for CPU and GPU. http://www.bealto.com/mp-opencl_intro.html SO account: http://stackoverflow.com/users/80448/eric-bainville – mfa Oct 27 '14 at 15:00
  • @mfa May it be good performance to implement multiprecision on GPU? – Adam Oct 28 '14 at 08:12
  • I think it's probably worth a try. The number of compute units on a modest gpu should at least match the performance of multiprecision on a cpu. The memory bandwidth will also outperform what is available on most motherboards as well. – mfa Oct 28 '14 at 13:12

0 Answers0