Best approach to FIFO implementation in a kernel OpenCL

Question

Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).

I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).

Description of the picture:

One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.

What happens with my code?

For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.

Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.

I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.

Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).

float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];

float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {

//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end

int globalId = get_global_id(0); 

float output = 0.0f;

//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
    tempArrayForShift[i] = temp[i];
}

//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
    temp[i] = tempArrayForShift[i];
}

//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];

//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i <  58; i ++) {
//  output = multipliedResult[i] + output;
}

//Returned summed value of temp array
return output;
}


__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { 

//Initialize the temporary array values to 0
for (int i = 0; i <  58; i ++) {
    temp[i] = 0;
    tempArrayForShift[i] = 0;
    multipliedResult[i] = 0;
}

//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i <  60; i ++) {
    Output[i] = fifo(Array[i], coefficients, 58);
}

}

I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.

Another alternative I was thinking about

In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.

a.sf0123456789abcde=a.s0123456789abcdef right shifts elements by 1 and puts last element to beginning element in a 16-wide vector. — huseyin tugrul buyukisik, May 24 '16 at 21:10
Where did you get a.sf0123456789abcde from? How would I be able to use the same format that you provided to an array I do not know the inputs of (I know them I just can't guess them, they are random). — VedhaR, May 25 '16 at 12:35
https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/shuffle.html. There is also indexed access to vector elements. — huseyin tugrul buyukisik, May 25 '16 at 12:52
@huseyintugrulbuyukisik Would this work for my vector though? I have 58 elements in this example but the values could be larger based on the user input. — VedhaR, May 25 '16 at 13:28
Even if opencl has in array shifting(I don't think so), simple gaming cards cannot do in-memory compute. You should download from array to vector, shift as many times needed, update array then. — huseyin tugrul buyukisik, May 25 '16 at 13:51
If I understand correctly, your goal is to pass in 1.2M values and 1.2M coefficients? — mfa, May 30 '16 at 17:15
1.2M values and only 65 coefficients. There is a temporary buffer that takes 65 elements from the input array (1.2million values) and multiplies it with the elements in the coefficient array (constant 65 elements) — VedhaR, May 30 '16 at 17:17

Jan Lucas · Accepted Answer · 2016-05-25T19:43:50.157

2

To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.

A simple implementation would be (launch with as many threads as output values):

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) 
{ 
    int globalId = get_global_id(0); 
    float sum=0.0f;
    for (int i=0; i< 58; i++)
    {
        float tmp=0;
        if (globalId+i > 56)
        {
            tmp=Array[i+globalId-57]*coefficient[57-i];
        }
        sum += tmp;
    }
    output[globalId]=sum;
}

This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.

edited May 25 '16 at 19:43

answered May 25 '16 at 16:17

Jan Lucas

681
5
12

Thanks for your answer. I will look into creating multiple threads when performing the calculations. Although one thing that has always confused me is the globalId value. In this case, since there are 57 work items, globalId would execute in data-parallel fashion right? But if I had 1.2million elements, how would this work? Would that mean there are 1.2million work items and they execute at once as well? – VedhaR May 25 '16 at 17:07
The different work items are executed in parallel. However no current GPUs is big enough to execute 1.2 million work items in parallel. The GPU will execute as many work items in parallel as possible and then automatically switch to the next work item once a work item is finished and not all 1.2 million work items are finished. – Jan Lucas May 25 '16 at 17:43
Oh I see, so if I was to use 1.2 million elements as the limit instead of 58, the GPU would automatically switch at whatever its work item capacity is until the 1.2 million elements have been completed. – VedhaR May 25 '16 at 17:59
I still face the problem of implementing the fifo buffer though. It is kind of essential when it comes to what the program has to do, without the buffer the results will not be the same. I am not sure if I should create the fifo buffer in the main loop of the cpp file and just continue to shift and call the kernel until the input array is empty. Or if I should create it inside the kernel. – VedhaR May 25 '16 at 18:09
You do not have to actually implement the fifo buffer. You can easily figure out what would be in the FIFO buffer. Your FIFO buffer always contains a consecutive 58 element window into your Array. Only at the very beginning it is padded with zeros. Look at the for loop in the code. It loops through all elements of the FIFO. tmp contains the i-th element in the FIFO. Every thread just calculates one output value and reconstructed the state of the FIFO for the calculation of that output value. – Jan Lucas May 25 '16 at 18:39
Let me add one thing: – Jan Lucas May 25 '16 at 18:44
The idea of my kernel is not that you do the multiplication of the fifo elements with the coefficients in parallel. That is done sequentially within the same thread. Instead multiple output values are calculated in parallel. Your FIFO just looks like you first need to calculate one output value, before you can take the new fifo state and calculate the next one, but that is not actually true. If you want to calculate the k-th output of your algorithm, you can just reconstruct the FIFO state for that output from the input array, knowledge of the k-1 fifo state is not required. – Jan Lucas May 25 '16 at 18:55
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/112955/discussion-between-vedhar-and-jan-lucas). – VedhaR May 25 '16 at 18:59

score 1 · Answer 2 · answered May 30 '16 at 20:31

Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.

the steps:

init local values to 0
copy coefficients to local variable

looping over the output elements to compute:

shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier

the code:

__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){

    int globalId = get_global_id(0);
    int localId = get_local_id(0);  
    int localSize = get_local_size(0);  

    //1  init local values to 0
    localArray[localId] = 0.0f

    //2  copy coefficients to local
    //don't bother with this id __constant is working for you
    //requires another local to be passed in: localCoeff
    //localCoeff[localId] = coefficients[localId];

    //barrier for both steps 1 and 2
    barrier(CLK_LOCAL_MEM_FENCE);

    float tmp;
    for(int i = 0; i< outputSize; i++)
    {
        //3  shift elements (+barrier)
        if(localId > 0){
            tmp = localArray[localId -1]
        }
        barrier(CLK_LOCAL_MEM_FENCE);
        localArray[localId] = tmp

        //4  copy new element (work item 0 only, + barrier)
        if(localId == 0){
            localArray[0] = Array[i];
        }
        barrier(CLK_LOCAL_MEM_FENCE);

        //5  compute dot product
        //5a multiply + barrier
        localSums[localId] = localArray[localId] * coefficients[localId];
        barrier(CLK_LOCAL_MEM_FENCE);
        //5b reduction loop + barrier
        for(int j = 1; j < localSize; j <<= 1) {
            int mask = (j << 1) - 1;
            if ((localId & mask) == 0) {
                localSums[local_index] += localSums[localId +j]
            }
            barrier(CLK_LOCAL_MEM_FENCE);
        }

        //6 copy dot product (WI 0 only)
        if(localId == 0){
            Output[i] = localSums[0];
        }

        //7 barrier
        //only needed if there is more code after the loop.
        //the barrier in #3 covers this in the case where the loop continues
        //barrier(CLK_LOCAL_MEM_FENCE);
    }

}

What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:

for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)

Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.

    //1  init local values
    if(groupId == 0){
        localArray[localId] = 0.0f
    }else{
        localArray[localSize - localId] = Array[workAmount - localId];
    }

These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.

Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.

Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.

Thanks for your detailed answer mfa. I would accept two answers if I could, really appreciate the response though! :) Although I need to look into barriers a bit more, my understanding of them is not the best. Although, when I pass variables in through my main cpp code, how would I declare these local variables? Do I just assign a random amount of space for it when I set kernel arguments? — VedhaR, May 31 '16 at 14:10
Here is a good SO answer to that question: http://stackoverflow.com/a/8888861/1152356 You call setKernelArg, using the correct size (sizeof(float)*65) and NULL s the value. — mfa, May 31 '16 at 15:29

Best approach to FIFO implementation in a kernel OpenCL

2 Answers2

Linked