Understanding work-items and work-groups

Question

Based on my previous question:

I'm still trying to copy an image (no practical reason, just to start with an easy one):

The image contains 200 * 300 == 60000 pixels.

The maximum number of work-items is 4100 according to CL_DEVICE_MAX_WORK_GROUP_SIZE.

kernel1:

std::string kernelCode =
            "void kernel copy(global const int* image, global int* result)"
            "{"
                "result[get_local_id(0) + get_group_id(0) * get_local_size(0)] = image[get_local_id(0) + get_group_id(0) * get_local_size(0)];"
            "}";

queue:

for (int offset = 0; offset < 30; ++offset)
        queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000));
queue.finish();

Gives segfault, what's wrong?

With the last parameter cl::NDRange(20000) it doesn't, but gives back only part of the image.

Also I don't understand, why I can't use this kernel:

kernel2:

std::string kernelCode =
            "void kernel copy(global const int* image, global int* result)"
            "{"
                "result[get_global_id(0)] = image[get_global_id(0)];"
            "}";

Looking at this presentation on the 31th slide:

Why can't I just simply use the global_id?

EDIT1

Platfrom: AMD Accelerated Parallel Processing

Device: AMD Athlon(tm) II P320 Dual-Core Processor

EDIT2

The result based on huseyin tugrul buyukisik's answer:

enter image description here

EDIT3

With the last parameter cl::NDRange(20000):

enter image description here

Kernel is both ways the first one.

EDIT4

std::string kernelCode =
                "void kernel copy(global const int* image, global int* result)"
                "{"
                    "result[get_global_id(0)] = image[get_global_id(0)];"
                "}";
//...
cl_int err;
    err = queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(59904), cl::NDRange(128));

    if (err == 0)
        qDebug() << "success";
    else
    {
        qDebug() << err;
        exit(1);
    }

Prints success.

Maybe this is wrong?

int size = _originalImage.width() * _originalImage.height();
int* result = new int[size];
//...
cl::Buffer resultBuffer(context, CL_MEM_READ_WRITE, size);
//...
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, size, result);

The guilty was:

cl::Buffer imageBuffer(context, CL_MEM_USE_HOST_PTR, sizeof(int) * size, _originalImage.bits());
cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * size);
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(int) * size, result);

I used size instead of sizeof(int) * size.

You can use `get_global_id(0)`. The pattern in your upper kernel is a CUDA-ism, since CUDA does not provide the equivalent of `get_global_id(0)`. — Krzysztof Kosiński, Feb 07 '14 at 14:34
The picture has two pictures inside. Which one is of my answer? — huseyin tugrul buyukisik, Feb 07 '14 at 14:56
@huseyintugrulbuyukisik The left one is the one I'm trying to copy, the right one is the result, based on your first solution. — otisonoza, Feb 07 '14 at 14:57
@otisonoza can you try 59904 as global range and 0 as reference point and 128 as local range please? Try like: cl_int err; err=cq.enqueueNDRangeKernel(kernelFunction,referenceRange,globalRange,localRange); if it is opencl 1.2 c++ bindings you are using. — huseyin tugrul buyukisik, Feb 07 '14 at 15:08
cl::Buffer resultBuffer(context, CL_MEM_READ_WRITE, size); size must be size of an integer (probably 4) multiplied by number of elements(60000 or 59000ish) so it must be about 240000 — huseyin tugrul buyukisik, Feb 07 '14 at 15:26

huseyin tugrul buyukisik · Accepted Answer · 2014-02-07T15:35:25.317

Edit 2:

Try non constant memory specifier please(maybe not compatible with your cpu):

std::string kernelCode =
            "__kernel void copy(__global int* image, __global int* result)"
            "{"
                "result[get_global_id(0)] = image[get_global_id(0)];"
            "}";

also you may need to change buffer options too.

Edit:

You have forgotten three '__'s before 'global' and 'kernel' specifiers so please try:

std::string kernelCode =
            "__kernel void copy(__global const int* image, __global int* result)"
            "{"
                "result[get_global_id(0)] = image[get_global_id(0)];"
            "}";

Total elements are 60000 but you are doing an offset+60000 which overflows and reads/writes unprivilaged areas.

The usual usage of ndrange for opencl 1.2 c++ bindings must be:

cl_int err;
err=cq.enqueueNDRangeKernel(kernelFunction,referenceRange,globalRange,localRange);

Then check err for the real error code you seek. 0 means succeess.**

If you want to divide work into smaller parts you should cap the range of each unit by 60000/N

If you divide by 30 parts, then

for (int offset = 0; offset < 30; ++offset)
        queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(offset * 2000), cl::NDRange(60000/30));
queue.finish();

And double check the size of each buffer e.g. sizeof(cl_int)*arrElementNumber

Becuase size of an integer may not be same for the device integer. You need 60000 elements? Then you need 240000 bytes to pass as size when creating buffer.

For compatibility, you should check for size of an integer before creating buffers if you are up to run this code on another machine.

You may know this already but Im gonna tell anyway:

CL_DEVICE_MAX_WORK_GROUP_SIZE

is number of threads that can share local/shared memory in a compute unit. You dont need to divide your work just for this. Opencl does this automatically and gives a unique global id for each thread along whole work, and gives unique local id for each thread in a compute unit. If CL_DEVICE_MAX_WORK_GROUP_SIZE is 4100 than it can create threads that share same variables in a compute unit. You can compute all 60000 variables in a single sweep with just an adition: multiple workgroups are created for this and each group has a group id.

  // this should work without a problem
  queue.enqueueNDRangeKernel(imgProcess, cl::NDRange(0), cl::NDRange(60000));

If you have an AMD gpu or cpu and if you are using msvc, you can install codexl from amd site and choose system info from drop-down menu to look at relevant numbers.

Which device is that of yours? I couldnt find any device with a max work group size of 4100! My cpu has 1024, gpu has 256. Is that a xeon-phi?

For example total work items can be as big as 256*256 times work group size here. enter image description here

Codexl has other nice features such as performance profiling, tracing code if you need maximum performance and bugfixing.

Your first solution gives back just a few rows from the top. Your second one gives segfault. — otisonoza, Feb 07 '14 at 14:43
@otisonoza are you saying that you get always a segfault whenever you use 60000 as global size? What do you mean by "rows" ? — huseyin tugrul buyukisik, Feb 07 '14 at 14:46
*"You dont need to divide your work just for this."* - This may be the most important point here. The number of work-items is virtually infinitely large (and if you don't use local memory, you can just use a local work size of 0, to let OpenCL choose an appropriate one automatically) — Marco13, Feb 07 '14 at 14:48
@huseyintugrulbuyukisik With kernel2 and 2nd solution, yes it gives segfault. However, with only 20000, it runs fine, but again, I only get the first part of the image. — otisonoza, Feb 07 '14 at 15:07
@huseyintugrulbuyukisik Said three "_", but there are two in your code. :) 3 : invalid parameter, 2 : segfault. — otisonoza, Feb 07 '14 at 15:23
@otisonoza you should give 4 * 60000 as size for buffer creation because it is integer for each element and should be 4 bytes each so needs 240000 for ints 480000 for doubls or longs 60000 for char/unsigned chars. Also booleans are 1 byte represented for many situations. — huseyin tugrul buyukisik, Feb 07 '14 at 15:29

Understanding work-items and work-groups

1 Answers1

Linked