Optimizing kernel code in opencl for a GPU

Question

As of now, my GPU is slower than my CPU when it comes to kernel execution time. I thought maybe since I was testing with a small sample, the CPU ended up finishing faster because of a smaller startup overhead. However, when I tested the kernel with data almost 10 times the size of the sample, the CPU was still finishing faster and the GPU was almost 400ms behind.

Runtime with 2.39MB file CPU: 43.511ms GPU: 65.219ms

Runtime with 32.9MB file CPU: 289.541ms GPU: 605.400ms

I tried using local memory, although I'm 100% sure I was using it wrong, and ran into two issues. The kernel finishes anywhere between 1000-3000ms (depending on what size I set for localWorkSize) or I run into a status code of -5, which is CL_OUT_OF_RESOURCES.

Here is the kernel that a fellow SO member helped me out with.

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {

int globalId = get_global_id(0); 
float sum=0.0f;
for (int i=0; i< 65; i++)
{
    float tmp=0;
    if (globalId+i > 63)
    {
        tmp=Array[i+globalId-64]*coefficients[64-i];    

    }

    sum += tmp;

}
Output[globalId]=sum;
}

This was my attempt at using local memory. First bit will be a snippet from the host code and the following portion is the kernel.

//Set the size of localMem
status |= clSetKernelArg(
    kernel,
    2,
    1024, //I had num_items*(float) but it gave me a -5. Num items is the amount of elements in my array (around 1.2 million elements)
    null);
printf("Kernel Arg output status: %i \n", status);

//set a localWorkSize
localWorkSize[0] = 64;

//execute the kernel with localWorkSize included
status = clEnqueueNDRangeKernel(
    cmdQueue,
    kernel,
    1,
    NULL,
    globalWorkSize,
    localWorkSize,
    0,
    NULL,
    &someEvent);


 //Here is what I did to the kernel*************************************** 
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output, __local float *localMem) {

int globalId = get_global_id(0);
int localId = get_local_id(0);  

localMem[localId] = globalId[globalId];

float sum=0.0f;
for (int i=0; i< 65; i++)
{
    float tmp=0;
    if (globalId+i > 63)
    {
        tmp=localMem[i+localId-64]*coefficients[64-i];  

    }

    sum += tmp;

}
Output[globalId]=sum;
}

Reference link I used when trying to set local variables: How do I use local memory in OpenCL?

Link used to find kernelWorkGroupSize (this is why I have 1024 set in the kernelArg): CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?

I've seen other people have similar problems where the GPU is slower than the CPU but for many of them, they are using clEnqueueKernel instead of clEnqueueNDRangeKernel.

Heres my previous question if you need more info on this kernel: Best approach to FIFO implementation in a kernel OpenCL

Found some optimization tricks for GPU's aswell. https://developer.amd.com/wordpress/media/2012/10/Optimizations-ImageConvolution1.pdf

Edited code; Error still exists

__kernel void lowpass2(__global float *Array, __global float *coefficients, __global float *Output) {

int globalId = get_global_id(0); 
float sum=0.0f;
float tmp=0.0f;
for (int i=64-globalId; i< 65; i++)
{

tmp = 0.0f;
tmp=Array[i]*coefficients[i];    
sum += tmp;

}
Output[globalId]=sum;
}

I'm pretty sure you *really* don't want an `if()`-statement in your inner `for`-loop. A smart compiler *may* be able to hoist the `if` out of the loop, but a gpu-driver *probably* doesn't have the time or the smarts to do this efficiently. — EOF, May 30 '16 at 15:28
@EOF I am going to take a look into switch statements as an alternative to if(). — VedhaR, May 30 '16 at 15:41
@mfa I am trying to implement a lowpass fir filter. The filter works great, it's just a matter of reducing the kernel execution time now. — VedhaR, May 30 '16 at 15:41
@VedhaR: Why on earth would you need a switch-statement? That's probably even worse. Let's take a simple look at the conditional, shall we? `if (globalId+i > 63)` is quite obviously equivalent to `if (globalId+i >= 64)`, which is obviously equivalent to `if (i >= 64 - globalId)`, which is easily eliminated by changing the for-loops initialization statement: `for (int i=64-globalId; i< 65; i++)`. — EOF, May 30 '16 at 16:05
@EOF so i ran into a problem. I eliminated the if statement and changed the for loop as you suggested. Now the screen freezes and the display driver crashes. I'm sure its the logic inside the loop that's causing the program to crash. Specifically, tmp=Array[i+globalId-64]*coefficients[64-i]; Not too sure how I have to alter the code to fit the for loop that you provided — VedhaR, May 30 '16 at 18:46
@VedhaR I don't think I can help you with the information you provide. I believe the transformation is correct for reasonable bounds on `globalId`, so I can't see where the problem would originate. You also haven't posted the new code. Does the new code work correctly on the CPU implementation? If not, what does your debugger say? — EOF, May 30 '16 at 18:52
@EOF If I run it on the CPU, it stlll crashes. The debugger tells me 'Unhandled exception at 0x003002E9 in 65elementmult.exe: 0xC0000005: Access violation reading location 0x0BF6BFFC.' This message is repeated for the rest of the messages with different address locations. If I run it on the GPU everything freezes and the display driver fails. I'll add the code that gives me this problem above. What other information would you need? In the loop I just replaced the indexes with i as I was trying to debug why it was crashing — VedhaR, May 30 '16 at 19:11
@VedhaR: Seriously? Why exactly do you expect `tmp=Array[i]*coefficients[i];` to be equivalent to `tmp=Array[i+globalId-64]*coefficients[64-i];`? — EOF, May 30 '16 at 19:19
@EOF I know they aren't equal. I originally had tmp=Array[i+globalId-64]*coefficients[64-i]; Afterwards I tried playing with the values of the indexes to see why that particular line breaks the code. Thats why there are i variables in both Array and coefficient. — VedhaR, May 30 '16 at 19:22
I added an answer to your original question. That seemed to be a better place for it. — mfa, May 30 '16 at 20:32

huseyin tugrul buyukisik · Accepted Answer · 2016-06-02T14:46:27.947

5

Running the following kernel for 24 million element arrays

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {

int globalId = get_global_id(0); 
float sum=0.0f;
for (int i=0; i< 65; i++)
{
    float tmp=0;
    if (globalId+i > 63)
    {
        tmp=Array[i+globalId-64]*coefficients[64-i];    

    }

    sum += tmp;

}
Output[globalId]=sum;
}

is completed under 200 ms for a 25 compute unit device pool but over 500 ms for a 8 core cpu.

Either you have a high-end cpu and a low-end gpu or the gpu driver has been gimped or gpu's pci-e interface is stuck at pci-e 1.1 @ 4x bandwidth so array copies between host and device is limited.

On the other hand, this optimized version:

__kernel void lowpass(__global __read_only float *Array,__constant  float *coefficients, __global __write_only float *Output) {

        int globalId = get_global_id(0); 
        float sum=0.0f;
        int min_i= max(64,globalId)-64;
        int max_i= min_i+65;
        for (int i=min_i; i< max_i; i++)
        {
            sum +=Array[i]*coefficients[globalId-i];    
        }
        Output[globalId]=sum;
}

has under 150 ms for cpu(8 compute unit) and under 80ms for gpu(25 compute unit) compute times. Work per item is only 65 times. This low number of operations could be very easily accelerated using __constant and __read_only and __write_only parameter specifiers and some integer work reducing.

Using float4 instead of float type for Array and Output should increase speed by %80 for both your cpu and gpu since those are SIMD type and vector compute units.

Bottlenecks of this kernel are:

Only 65 multiplications and 65 summations per thread.
But still the data travels over pci-express interface, slow.
Also 1 conditional check( i < max_i) per float operation is high, needs loop unrolling.
Everything being scalar although your cpu and gpu are vector based.

Generally:

Running kernel for the first time triggers just in time compiler optimization of opencl, slow. Run at least 5-10 times for exact timings.
__constant space is only 10 - 100 kB but its faster than __global and is good for amd's hd5000 series.
Kernel overhead is 100 microseconds while 65 cache operations are less than that and is shadowed by kernel overhead time(and even worse, by pci-e latency).
Too few work items makes occupation ratio less, slow.

Also:

4-core Xeon @ 3 GHz is much faster than 16(1/4 of vliw5)*2(compute units)=32 cores of gpu @600 MHz because of branch prediction, total cache bandwidth, instruction latency and no-pcie latency.
HD5000 series amd cards are legacy, same as gimped.
HD5450 has 166 GB/s constant-memory bandwidth
Which also has only 83 GB/s LDS(local memory) bandwidth
Which also has 83 GB/s L1 and L2 cache bandwidths so just let it work on __global driver optimizations instead of LDS unless you plan on upgrading your computer.(for Array ofcourse) Maybe, odd elements from LDS, even elements from __global could have 83+83 = 166 GB/s bandwidth. You can try. Maybe two by two is better than alternating in terms of bank conflicts.
Using coefficients as __constant (166 GB/s) and Array as __global should give you 166 + 83 = 249 GB/s combined bandwidth.
Each coefficient element is used for only once per thread so I'm not suggesting to use private registers (499 GB/s)

edited Jun 02 '16 at 14:46

answered May 31 '16 at 11:17

huseyin tugrul buyukisik

11,469
4
45
97

I am using a Intel Xeon 3580 at 3.33Ghz (pretty sure it has 4 cores) and for the graphics card its a Radeon 5450. I searched the compute units and apparently the Radeon has 2 units only. It's good to know the code isn't at fault here – VedhaR May 31 '16 at 12:49
Optimized code to have 3x speed but not sure if it has any proper output. – huseyin tugrul buyukisik May 31 '16 at 13:04
Also HD5450 is a vector architecture and your kernel is scalar type so both cpu and gpu is underutilized. You should change it to vector version. I will try that at the same time. But vector type makes it very hard and new gpu tech is scalar today. – huseyin tugrul buyukisik May 31 '16 at 13:08
Its constant memory is faster than its local memory so coefficients should use constant memory instead of local memory. Maybe using both by sharing coefficient to both memories gives best result but it wouldn't be worthy when there is pci-e bottlenecking. – huseyin tugrul buyukisik May 31 '16 at 14:45
So I tried testing the last kernel, the optimised one, against a wav file that I used as an input. After writing the output values to another wav file, I tried to listen to the filtered song and nothing was playing. However, when I set all the coefficients to be 1/64 (instead of matlab computed values for a specific cutoff frequency), the song plays but its very quiet (not filtered, just the volume has been reduced). – VedhaR Jun 01 '16 at 17:40
Also, in the case where we took the if statement out, changing coefficients from global to constant also increases the execution time. I guess there are specific scenarios where they are used? Or maybe it has to be accommodated for in the kernel code (by this I mean you can't just simply change from __global to __constant, some other variables need to be changed in the kernel code itself). – VedhaR Jun 01 '16 at 17:51
Is this with cpu or gpu? – huseyin tugrul buyukisik Jun 01 '16 at 20:09
I tried it with both and both produce the same output, no noise. However, GPU is faster at computing than the CPU with that kernel. – VedhaR Jun 02 '16 at 14:30
I made a mistake in max_i. -64 is excessive. There is -64 already in min_i. Cpu is actually 180 ms but gpu stays same timing because of pci-e. – huseyin tugrul buyukisik Jun 02 '16 at 14:36
I'm also having the same problem as the answer posted below. After the 64th input value, the output is stuck at 0.999825 until the last 64 elements. Is this an issue with the kernel or the way I have implemented my host code. If it is the host code, I will make another SO question post – VedhaR Jun 03 '16 at 19:29
I used an input array consisting of all 1's. But the weird thing is as soon as I change my input array back to the wav file, the values seem to be changing – VedhaR Jun 03 '16 at 19:31
@VedhaR you mean, after kernel it changes input array? – huseyin tugrul buyukisik Jun 03 '16 at 20:24
Not the input array but the output, if I use an input array consisting of 1's I see that the first 65 elements are multiplied correctly (after the kernel execution). Then from the 65th element to (N-65th) element the values are repeated. And the last 65 elements are also multiplied correctly. But when the input array is changed to a wav file, everything works fine. Idk why it does that – VedhaR Jun 06 '16 at 12:23
Maybe coefficients[max_i-i] instead of coefficients[globalId-i] works? – huseyin tugrul buyukisik Jun 06 '16 at 17:48
Same issue, maybe it's just that specific use case. But it's okay though, it is working for the main cause. However, I don't under stand how max_i and min_i work. Like if I was to explain to someone, how would you put it in your own words. It's just that bit of the kernel that confuses me. – VedhaR Jun 07 '16 at 17:03
Actually never mind, I understand it. I just confused myself. Thanks again. – VedhaR Jun 07 '16 at 17:05

score 3 · Answer 2 · answered May 30 '16 at 20:36

3

Before introducing local memory let's first move if statement out of the loop:

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) 
{
int globalId = get_global_id(0); 
float sum=0.0f;
int start = 0;
if(globalId < 64)
    start = 64-globalId;
for (int i=start; i< 65; i++)
    sum += Array[i+globalId-64] * coefficients[64-i];    
Output[globalId]=sum;
}

Then introduction of local memory could be implemented like this:

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) 
{
    int globalId = get_global_id(0);
    int local_id = get_local_id(0);

    __local float local_coefficients[65];
    __local float local_array[2*65];

    local_coefficient[local_id] = coefficients[local_id];
    if(local_id == 0)
        local_coefficient[64] = coefficients[64];
    for (int i=0; i< 2*65; i+=get_local_size(0))
    {
        if(i+local_id < 2*65)
            local_array[i+local_id] = Array[i+global_id];
    }
    barrier(CLK_LOCAL_MEM_FENCE);

    float sum=0.0f;
    int start = 0;
    if(globalId < 64)
        start = 64-globalId;
    for (int i=start; i< 65; i++)
        sum += local_array[i+local_id] * local_coefficient[64-i];    
    Output[globalId]=sum;
}

P.S. There could be some mistakes there like global to local index recalculations, etc. (I'm about to go to sleep now :) ) Nonetheless above implementation should put you into right direction how to start using local memory.

answered May 30 '16 at 20:36

doqtor

8,414
2
20
36

Thanks for your answer! I can say that removing the if statement improved the kernel tie by 150ms. However adding local memory pretty much made it jump to 900ms (double what it was). But, using that last implementation you provided, I think I can make rave music now haha, it changed the song in the weirdest way possible. – VedhaR Jun 01 '16 at 17:46
But I get the idea however, instead of using global memory to reference the coefficients, I can bring those values into local and use it that way (should be faster and the coefficients don't change). However, what is the value of localId in this case? – VedhaR Jun 01 '16 at 17:56
In your example 'localWorkSize[0] = 64;' and I use the same. To copy from `__global` into `__local` buffer the 64 work items copy first 64 values (each work item copies one value as `__local` means the buffer is shared / visible to all work items), then first work item copies the last value. – doqtor Jun 01 '16 at 18:09
Ok cool, but what if I had more than one __local variable declaration, and each one had a different size, does get_local_id(x) return the value of the variables declared in the order they were initialised? x in this case is the 'xth' element that was initialised. Sorry that was a weird one to try and type out, let me know if it doesn't make sense. – VedhaR Jun 01 '16 at 18:16
1

[This](https://jorudolph.wordpress.com/2012/02/03/opencl-work-item-ids-globalgrouplocal/) blog post should clarify it for you. – doqtor Jun 01 '16 at 18:29
Btw I just noticed something wierd. I decided to output the final results onto a text file, I've noticed that the first dozen output values change fine but then it gets stuck at .99985 until the very last dozen lines where the values change again. I'm pretty sure I had this issue before. – VedhaR Jun 03 '16 at 19:22

Optimizing kernel code in opencl for a GPU

2 Answers2