Get optimum local/global workgroup size in OpenCL?

Question

I am using the following function to get the best local and workgroup size for my OpenCL application.

//maxWGSize == CL_KERNEL_WORK_GROUP_SIZE
//wgMultiple == CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
//compUnits == CL_DEVICE_MAX_COMPUTE_UNITS
//rems == max required work items

void MyOpenCL::getBestWGSize(cl_uint maxWGSize, cl_uint wgMultiple, cl_uint compUnits, cl_uint rems, size_t *gsize, size_t *lsize) const
{
    cl_uint cu = 1;
    if(wgMultiple <= rems)
    {
        bool flag = true;
        while(flag)
        {
            if(cu < compUnits)
            {
                cu++;
                if((wgMultiple * cu) > rems)
                {
                    cu--;
                    flag = false;
                    break;
                }
            }
            else if(wgMultiple < maxWGSize)
            {
                wgMultiple *= 2;
                if((wgMultiple * cu) > rems)
                {
                    wgMultiple /= 2;
                    flag = false;
                    break;
                }
            }
            else
            {
                cu++;
                if(((wgMultiple * cu) > rems) || (cu > 2 * compUnits))
                {
                    cu--;
                    flag = false;
                    break;
                }
            }
        }
    }
    else
    {
        bool flag = true;
        wgMultiple = 2;
        while(flag)
        {
            if(cu < compUnits)
            {
                cu++;
                if((wgMultiple * cu) > rems)
                {
                    cu--;
                    flag = false;
                    break;
                }
            }
            else
            {
                wgMultiple *= 2;
                if((wgMultiple * cu) > rems)
                {
                    wgMultiple /= 2;
                    flag = false;
                    break;
                }
                else
                {
                    cl_int temp = rems - (wgMultiple * cu);
                    if((temp == 0) || (temp == 1))
                    {
                       flag = false;
                       break;
                    }
                }
            }
        }
    }

    *gsize = wgMultiple * cu;
    *lsize = wgMultiple;
    if(rems < *gsize)
    {
        *gsize = rems;
        *lsize = rems;
    }
    if(cu != compUnits)
    {
        while((cu * 2) <= compUnits)
        {
            cu *= 2;
            if(*lsize % 2 == 0)
                *lsize /= 2;
        }
    }
}

The algorithm is:

Decide how many work group's are required if local size == CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
If still more work units are required multiply local size by 2 until it reaches CL_KERNEL_WORK_GROUP_SIZE

Any suggestions in improving the algorithm?

Some results that I am getting:

for GPU if max required work items == 99
maxWGSize    256 
wgMultiple   64 
compUnits    6 
rems     99 
*gsize   64 
*lsize   16 


for GPU if max required work items == 35
maxWGSize    256 
wgMultiple   4 
compUnits    6 
rems     35 
*gsize   24 
*lsize   4 

for GPU if max required work items == 57
maxWGSize    256 
wgMultiple   8 
compUnits    6 
rems     57 
*gsize   48 
*lsize   8 

for CPU if max required work items == 99
maxWGSize    1024 
wgMultiple   16 
compUnits    4 
rems     99 
*gsize   64 
*lsize   16 

for CPU if max required work items == 35
maxWGSize    1024 
wgMultiple   8 
compUnits    4 
rems     35 
*gsize   32 
*lsize   8

for CPU if max required work items == 57
maxWGSize    1024 
wgMultiple   8 
compUnits    4 
rems     57 
*gsize   32 
*lsize   8

If you don't tell us what is the kernel you are running we cannot tell you how to select any parameter. In addition, if you don't need any specific WG size, use the default. Don't mess with the inner optimization algorithm since the chances are that you will get a worse value. — DarkZeros, Apr 09 '14 at 16:38
@DarkZeros This is a follow-up question of http://stackoverflow.com/questions/22966890/opencl-kernel-performing-very-poor — Marco13, Apr 09 '14 at 16:39
Can you give some examples of inputs to the function and outputs? Is quite hard to guess having so many branches what are you trying to achieve. — DarkZeros, Apr 09 '14 at 16:56
@DarkZeros the function is kernel independent, so I dont think that information is required. CL_KERNEL_WORK_GROUP_SIZE & CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE help us identify optimum local WG size for that kernel. By multiplying this with computation units we get global WG size. — Cool_Coder, Apr 09 '14 at 17:19
@DarkZeros I added the algorithm to give some understanding of my logic. — Cool_Coder, Apr 09 '14 at 17:28
@DarkZeros I also added some of the results that I am getting for local and global work group size. — Cool_Coder, Apr 09 '14 at 17:43
Still rather confusing. The `CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE` *not* kernel independent (as it is queried for a *kernel*!). And it is only a performance hint. The local work size should be a *multiple* of this value, and I can not imagine that the `CL_KERNEL_WORK_GROUP_SIZE` will ever *not* be a multiple of this value. Additionally, the output (e.g. the last one) seems to say things like "For 57 items, the global size will be 32", but I guess I'm misinterpreting something here. — Marco13, Apr 09 '14 at 18:00
As I understand this code you have a global maximum value and you try to approach it? I really don't understand what you are doing, since I can't believe your global amount of WI is just 32 to 64. Who is defining your global amount of WI? Is this a manual inputted value? It is not possible for any kernel to take 5 seconds to execute 32 WI. IE: in the first case the GPU is telling you "I can run up to 256 WI per work group for this kernel but I rather preffer you to use a multiple of 64". And your answer is "Run 64 WI, with a size of 16". Thats rather inefficient and ilogical as well. — DarkZeros, Apr 09 '14 at 18:09
@Marco13 the values passed to the above function for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE & CL_KERNEL_WORK_GROUP_SIZE are specific to a kernel and these values are queried before my function. So my function is kernel independent as the kernel dependent data is passed to the function and not queried inside the function. As DarkZeros pointed out, I have a max requirement of work items specified by the variable rems. So if say my requirement is 57 work items then a global size of 32 is best for me because it is a multiple of 8 & uses all 4 compute units, so local size is 32/4 = 8. — Cool_Coder, Apr 10 '14 at 14:48
@DarkZeros I have a 2D array on which I am performing the computation. For each call to clEnqueueNDRangeKernel I am computing 1 row. For the shown examples the size of array is 99x99. So for any row, in first kernel call with GPU I am computing 64 items as explained in previous comment. So remaining items for next call is 99 - 64 = 35. Of these 24 items are computed in GPU. Since the remaining items are very less they are computed serially on CPU like normal C++ execution. Now to execute 99 rows in this manner it is taking 120ms. — Cool_Coder, Apr 10 '14 at 14:49
After all rows are computed then the entire process is repeated for 98 rows. Then for 97 rows and so on till the last row. To make it further clear I am reducing the 2D matrix to lower triangular matrix by Gauss Jordan elimination technique. The times I mentioned for comparison is for reducing the matrix to lower triangular matrix and not just for 64 WI. I hope this clears some confusion. — Cool_Coder, Apr 10 '14 at 14:49
@Cool_Coder Threre are different approaches for computing a triangular matrix, but I'm pretty sure that the approach that you described is not the most elegant and certainly not the most efficient one (particularly when so few work items are used - imagine that you have maybe 500 or 1000 cores!). Maybe the source code of http://viennacl.sourceforge.net/ brings some inspiration (and maybe it already offers the desired functionality out of the box...) — Marco13, Apr 11 '14 at 22:47
@Marco13 Thanks for your suggestions! I will try to think differently now :) And thanks for showing ViennaCL. Looks like it already does what I am trying to do. But still for the learning experience I will continue my efforts... — Cool_Coder, Apr 12 '14 at 05:50

score 4 · Accepted Answer · edited May 23 '17 at 11:53

Admittedly, I did not understand (and hardly tried to understand) what you are trying to compute there, because it looks overly complicated: Determining the best work-group size should hardly be related to the number of compute units, and it should not be necessary to compute it in such a complicated way.

As I said in the answer to the original question (and as confirmed by DarkZeros in his comment : As long as you don't use local memory etc., you can usually just pass null as the local work size, and OpenCL will choose it appropriately.

There may be some caveats, though. Depending on the the global work size, the underlying OpenCL implementation may not be able to use a "good" local work group size. For example: When the global work size is a prime number (that is larger than the maximum local work size), then an OpenCL implementation may be forced to use a local work size of 1...

This can usually be circumvented by padding the data to be a multiple of a more appropriate local work size. First of all, this means that you have to modify your kernel so that it obeys the limits of the work size. In your kernel from the other question, you would have to add another parameter for the size, and check this accordingly:

__kernel void reduceURatios(
    __global myreal *coef, 
    __global myreal *row, 
    myreal ratio,
    int sizeOfArrays)  // Add this parameter
{
    size_t gid = get_global_id(0);
    if (gid >= sizeOfArrays)
    {
        return; // Don't access invalid elements
    }

    myreal pCoef = coef[gid];
    myreal pRow = row[gid];

    pCoef = pCoef - (pRow * ratio);
    coef[gid] = pCoef;
}

Then you have more freedom for choosing the global work size. The code from the current question involved the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, but this should hardly be relevant for such a trivial kernel on a standard GPU. In contrast to that, the CL_DEVICE_MAX_WORK_GROUP_SIZE would be a good choice for the local work size (as long as there is no other limitation imposed by the kernel itself, e.g. by register pressure - but this is also definitely not the case here).

So you could just use the CL_DEVICE_MAX_WORK_GROUP_SIZE as the basis for the computation of your global work size:

// As queried with CL_DEVICE_MAX_WORK_GROUP_SIZE
int maxWorkGroupSize = ...
int numWorkGroups = (n-1) / maxWorkGroupSize + 1;
int globalSizePadded = numWorkGroups * maxWorkGroupSize;

And then invoke your kernel with this (padded) global work size. The if-statement that you added in the kernel will make sure that the threads will not access invalid memory regions. And when you launch your kernel with this padded global size, and set the local size to null, it should automatically choose the CL_DEVICE_MAX_WORK_GROUP_SIZE as the local size (but of course, you could also specify it manually).

This might make the computation from the original question faster, but it's still unlikely that it will be faster than the CPU version...

I updated the question so that you can understand it easily. Also I put NULL for local size as you suggested. Now I am getting the time of 4700ms instead of 5300ms. So there is some improvement but still the performance is pretty bad. Maybe because of memory latency as you mentioned... — Cool_Coder, Apr 09 '14 at 17:48
@Cool_Coder Did you also try the padding, with a local work size of 256 (for your GPU) ? I'm just curious, and can not estimate how much it will bring in this case. — Marco13, Apr 09 '14 at 18:01
I think he is messing the key thing here, the difference between the global size, the group size, and the work items. Otherwise I don't understand that for a job that takes 5 seconds of GPU he is using only 16 to 64 global sizes. — DarkZeros, Apr 09 '14 at 18:17
As mentioned in comments to the question, the required work items is <100 in my example. So it would not make sense in creating 256 work items. This is exactly why function is there. If the required work items is >256 then it will choose 256 or a suitable multiple. Also the 5 sec is not just for 64 work items, but for a series of kernel calls as this is the only way I compare time with my multi threaded non-OpenCL version. — Cool_Coder, Apr 10 '14 at 14:57
If your process needs <100 WI, then you are wasting your time with OpenCL. Typical OpenCL usages are for some millions of WI. Your GPU will be idle all the time, it will probably have at least 256 WI per each compute unit. (so 100/2048 = 5% usage) — DarkZeros, Apr 10 '14 at 17:07
@DarkZeros so lets say there are 1 million WI. The kernel being computed performs 1 read and 1 write from/to global memory. What amount of time do you estimate would it take just for the data transfer? I know its pretty vague to predict but I just need some number so that I can understand whether my application will scale or not. — Cool_Coder, Apr 11 '14 at 13:51
@Cool_Coder You mentioned that you have a **series** of kernel calls (and that this "the only way") - are you sure that these kernel calls can not be *combined* into a single computation? Then, there *could* (!) be the chance to achieve a speedup (although it's hard to make any predictions until now...) — Marco13, Apr 11 '14 at 22:32
@Marco13 yes my approach was inefficient in the first place. I designed the algorithm trying to keep minimal computation in a kernel. But maybe its better to have some if conditions and compute with max possible WI. Thanks for correcting my path! — Cool_Coder, Apr 12 '14 at 05:52

Get optimum local/global workgroup size in OpenCL?

1 Answers1

Linked