1

I am in a startup of OpenCl and still learning.

Kernel Code:

__kernel void gpu_kernel(__global float* data)
{
    printf("workitem %d invoked\n", get_global_id(0));
    int x = 0;
    if (get_global_id(0) == 1) {
        while (x < 1) {
            x = 0;
        }
    }
    printf("workitem %d completed\n", get_global_id(0));
}

C code for invoking kernel

size_t global_item_size = 4; // number of workitems total
size_t local_item_size = 1; // number of workitems per group
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);

Ouput:

workitem 3 invoked
workitem 3 completed
workitem 0 invoked
workitem 0 completed
workitem 1 invoked
workitem 2 invoked
workitem 2 completed

## Here code is waiting on terminal for Workitem #1 to finish, which will never end

this clearly states, all workitems are parallel (but in different workgroup).

Another C code for invoking kernel (for 1 workgroup with 4 workitems)

size_t global_item_size = 4; // number of workitems total
size_t local_item_size = 4; // number of workitems per group
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);

Ouput:

workitem 0 invoked
workitem 0 completed
workitem 1 invoked
## Here code is waiting on terminal for Workitem #1 to finish, which will never end

This clearly states that, this running in sequence (that's why it completed 1st Workitem and then got stuck on second and rest are never executed)

My Question:

I need to invoke 1 workgroup with 4 workitems which run parallel. So that i can use barrier in my code (which i guess is only possible within single workgroup)?

any help/suggestion/pointer will be appreciated.

Vishwadeep Singh
  • 1,043
  • 1
  • 13
  • 38

1 Answers1

3

Your second host code snippet correctly launches a single work-group that contains 4 work-items. You have no guarantees that these work-items will run in parallel, since the hardware might not have the resources to do so. However, they will run concurrently, which is exactly what you need in order to be able to use work-group synchronisation constructs such as barriers. See this Stack Overflow question for a concise description of the difference between parallelism and concurrency. Essentially, the work-items in a work-group will make forward progress independently of each other, even if they aren't actually executing in parallel.

OpenCL 1.2 Specification (Section 3.2: Execution Model)

The work-items in a given work-group execute concurrently on the processing elements of a single compute unit.

Based on your previous question on a similar topic, I assume you are using AMD's OpenCL implementation targeting the CPU. The way most OpenCL CPU implementations work is by serialising all work-items from a work-group into a single thread. This thread then executes each work-item in turn (ignoring vectorisation for the sake of argument), switching between them when they either finish or hit a barrier. This is how they achieve concurrent execution, and gives you all the guarantees you need in order to safely use barriers within your kernel. Parallel execution is achieved by having multiple work-groups (as in your first example), which will result in multiple threads executing on multiple cores (if available).

If you replaced your infinite loop with a barrier, you would clearly see that this does actually work.

Community
  • 1
  • 1
jprice
  • 9,755
  • 1
  • 28
  • 32
  • thanks for your reply. You are correct. And yes i am using AMD's OpenCL implementation targeting the CPU. My main motto is to use barrier only. But, barrier will not work for multiple workgroups (that is my first example) I can use barrier in the case of single workgroup with multiple workitems. Replacing infinite loop with a barrier is clearly showing the use of barrier as stated by you. But, they are running "concurrently" (i can see print statements as 0,1,2,3 in a fixed sequence.. because of which i am assuming it is sequential/concurrent) How to make them parallel? – Vishwadeep Singh Apr 30 '14 at 08:16
  • you can also suggest me to look into some system properties that i need to check for confirming the reason. – Vishwadeep Singh Apr 30 '14 at 08:17
  • 1
    @Vishwadeep You cannot force the work-items of a single work-group to execute in parallel. On some devices they might, but the specification only guarantees that they will execute concurrently. As I mentioned, on CPUs these work-items will be serialised into a single thread, which allows them to implement barriers efficiently (as opposed to communicating barrier status between different cores via memory, which would be slow). – jprice Apr 30 '14 at 08:24
  • that answers my all queries @jprice .. Thanks.. another small question.. if i am using GeForce GT 630 with Device Type CL_DEVICE_TYPE_GPU.. will same logic be applied to it also?? – Vishwadeep Singh Apr 30 '14 at 09:14
  • 1
    @Vishwadeep The concurrency guarantees that you have are the same regardless of the device. On that particular GPU, I would expect that groups of 32 work-items from the same work-group (known as 'warps' on NVIDIA hardware) will execute in *parallel*, not just concurrently. – jprice Apr 30 '14 at 09:16