2

I just started to use OpenCL via the PyOpenCL interface from Python. I tried to create a very simple "recurrent" program where the outcome of each loop in every kernel depends on the output of another kernel from the last loop-cycle, but I am running into synchronization problems:

__kernel void part1(__global float* a, __global float* c)
{
    unsigned int i = get_global_id(0);

    c[i] = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    if (i < 9)
    {
        for(int t = 0; t < 2; t++){
            c[i] = c[i+1] + a[i];
            barrier(CLK_GLOBAL_MEM_FENCE);
       }
    }
}

The host application is

import pyopencl as cl
from numpy import *

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

#read in the OpenCL source file as a string
f = open('recurrent.cl', 'r')
fstr = "".join(f.readlines())

#create the program
program = cl.Program(ctx, fstr).build()

mf = cl.mem_flags

#initialize client side (CPU) arrays
a = array(range(10), dtype=float32)

#create OpenCL buffers
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)

#execute program
program.part1(queue, a.shape, None, a_buf, dest_buf)
c = empty_like(a)
cl.enqueue_read_buffer(queue, dest_buf, c).wait()

print "a", a
print "c", c

The outcome is

a [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
c [  0.   1.   5.   3.   4.  18.  13.   7.   8.   0.]

As you can see, some of the outcome values are correct. E.g. the third position = 5 = 3 + 2 but e.g. the second position is 2 = 0 + 2. So the summation was over the outcome of other threads at different time-points despite the barrier being in place. I thought the barrier would make sure all threads have reached it and have written their outcome to the global memory?

It's probably something very simple and I would appreciate any hints and comments!

PS: I am running this on a Sandy Bridge CPU using the Intel SDK.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
blue2script
  • 153
  • 1
  • 8

2 Answers2

2

I think I have the answer now. The OpenCL code was actually completely fine. However, the barriers only kick in if all threads are in one workgroup. This has not been the case, which is easy to check by reading out the local_id using get_local_id(0) (as suggested by Huseyin). In my case the host created a workgroup for every thread - instead of putting all threads in one workgroup. Performance-wise that makes sense, compare

Questions about global and local work size

In our case, however, we need to make sure the data is synchronized between the threads so all of them should be in one workgroup. To this end we need to change the execution of program 1,

program.part1(queue, a.shape, None, a_buf, dest_buf)

The second argument refers to the global_size of the job (so the number of threads created), whereas the third seems to refer to the local_size, i.e. the number of threads per workgroup. Thus, this line should read

program.part1(queue, a.shape, a.shape, a_buf, dest_buf)

This creates a workgroup with all threads (but keep an eye on the maximum size of workers allowed in one workgroup!). Now, the code still doesn't work. The last problem is concerned with the barriers in the OpenCL code: the last thread with id = 10 does not see the barriers in the loop and so all threads are waiting for the last one to hit the barrier (though I wonder why that doesn't throw an exception?). So we simply need to reduce the total number of threads (to get rid of the last one),

program.part1(queue, (a.shape[0]-1,), (a.shape[0]-1,), a_buf, dest_buf)

That works! Learned some lessons in the process...

Thanks again to Huseyin! blue2script

Community
  • 1
  • 1
blue2script
  • 153
  • 1
  • 8
0

Edit: user blue2script was right, it was an issue of "barrier not being hit by all local threads". On top of that, barrier can't synchronize between compute units / workgroups.

My answer doesn't add anything nor solve any problem here. So don't see the if in below kernel functions. It's wrong.


Incomplete

 __kernel void part1(__global float* a, __global float* c)
 {
      unsigned int i = get_global_id(0);

      c[i] = 0;
      barrier(CLK_GLOBAL_MEM_FENCE);

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              c[i] = c[i+1] + a[i];//c[i+1] is neighbour thread's variable
                                   //and there is no guarantee that
                                   //which one(ith or (i+1)st) computes first
                                   //so you need to get a copy of c[] first
              barrier(CLK_GLOBAL_MEM_FENCE);//thats why this line is not helping
          }
      }
 }

Using global

 __kernel void part1(__global float* a, __global float* c,__global float* d)
 {
      unsigned int i = get_global_id(0);

      c[i] = 0;
      d[i]=c[i]; 
      barrier(CLK_GLOBAL_MEM_FENCE);

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              d[i] = c[i+1] + a[i];//it is guaranteed that no neighbour thread can
                                   //change this threads d[i] element before/after
                                   //execution
              barrier(CLK_GLOBAL_MEM_FENCE);
              c[i]=d[i];
              barrier(CLK_GLOBAL_MEM_FENCE);
          }
      }
      barrier(CLK_GLOBAL_MEM_FENCE);

 }

Using locals(for workgroup size is 256 and total work size is a multiple of that):

 __kernel void part1(__global float* a, __global float* c)
 {
      unsigned int i = get_global_id(0);
      unsigned int Li=get_local_id(0);
      __local d[256];
      c[i] = 0;
      barrier(CLK_GLOBAL_MEM_FENCE);
      d[Li]=c[i]; 
      barrier(CLK_LOCAL_MEM_FENCE);

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              d[Li] = c[i+1] + a[i];//it is guaranteed that no neighbour thread can
                                   //change this threads d[i] element before/after
                                   //execution

             barrier(CLK_LOCAL_MEM_FENCE);
             c[i]=d[Li]; //guaranteed they dont interfere each other
             barrier(CLK_LOCAL_MEM_FENCE);
          }
      }

 }

Workgroup:

enter image description here

Using private

 __kernel void part1(__global float* a, __global float* c)
 {
      unsigned int i = get_global_id(0);
      unsigned int Li=get_local_id(0);
      __private f1;
      c[i] = 0;

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              f1 = c[i+1] + a[i];

             barrier(CLK_GLOBAL_MEM_FENCE);
             c[i]=f1; //guaranteed they dont interfere each other
             barrier(CLK_GLOBAL_MEM_FENCE);
          }
      }

 }
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • Dear Huseyin, thanks for your answer! I do have two issues: first, compilating your code raises the exception "global variables cannot be allocated inside kernel code". And second, I don't see why the allocation of another global variable "d" would help us here - c is also global. Also, note that every kernel should access the output of the other kernels continuously in every cycle of the loop, meaning that at "time" t, kernel i should access the output of kernel i+1, c[i+1], from time t-1 (implying the kernel i+1 finished its calculation and has updated its value in the global buffer). – blue2script Jun 15 '13 at 20:52
  • Regarding your comment "there is no guarantee that which one(ith or (i+1)st) computes first": that is precisely the problem. I thought the barrier would take care of that, but apparently it doesn't. Anway, thanks again! – blue2script Jun 15 '13 at 20:54
  • When your ith thread tries to get c[i], that may be used by (i-1) st thread, which is before or after, you cannot know that. Yes, you are right about declaring d[]. How can we declare globals in a kernel? Lets search that. – huseyin tugrul buyukisik Jun 15 '13 at 20:56
  • Can you make it(d[]) argument of kernel please? Edited in the answer. – huseyin tugrul buyukisik Jun 15 '13 at 20:58
  • PS: The resulting buffer c should be c = [1,3,5,7,9,11,13,15,17,0] and the program is essentially a complicated way of writing c[:-1] = a[:-1] + a[1:]. – blue2script Jun 15 '13 at 21:00
  • You can make the same thing using __local variables easily but you need to know your workgroup size. 256? 512? 1024? Total work size must a multiple of this workgroup size. – huseyin tugrul buyukisik Jun 15 '13 at 21:00
  • Sorry for my noob question: what do you mean with it(d[])? Thanks for hanging on! – blue2script Jun 15 '13 at 21:02
  • If you need to use global, then you need to have another argument in the kernel paranthesis. __global d[]. If you need local optimization, then you may look at the end of answer. – huseyin tugrul buyukisik Jun 15 '13 at 21:05
  • Dear Huseyin, I compiled your code and I get - an array of zeros... for whatever reason. But anyway, I think we are moving away a bit from the real issue of communication between kernels. I tried to use a global buffer the access the intermediate results of the other kernels and ensuring synchronization by using barrier. Now, I wonder if the failure of the barrier has to do with the workgroups. Apparently, barrier only works within a workgroup - may it be that I create many workgroups but should - ideally - only create one with many workers? – blue2script Jun 15 '13 at 21:12
  • Did you try the latest local example i put? Changed c[i]=d[i] to c[i]=d[Li] at the end. – huseyin tugrul buyukisik Jun 15 '13 at 21:14
  • Which local work size are you using for you kernel? 256? 1024? – huseyin tugrul buyukisik Jun 15 '13 at 21:16
  • I tried your example and now I essentially get the same result as before (meaning my code): some values are correct, some others are not. The synchronization still doesn't work. I guess the global_id is the id of the workgroup whereas the local_id is the id of the worker in the workgroup? Because the local_id Li is always zero. I am not sure about the local work size - how do I find out? – blue2script Jun 15 '13 at 21:22
  • Local id is local id of worker(0,1,...,255), global id is global(0,1,...,N) What is your global+local size settings? – huseyin tugrul buyukisik Jun 15 '13 at 21:23
  • Ah, I start to understand - so every workgroup can have only a maximum number of worker (depending on the platform)? – blue2script Jun 15 '13 at 21:24
  • Exactly. Just changed the local example. Can you look at it? – huseyin tugrul buyukisik Jun 15 '13 at 21:26
  • Compiled your updated code - still the same result I fear (note the type of d should be defined). – blue2script Jun 15 '13 at 21:28
  • So the global id is actually calculated as global_id(d) = global_offset(d) + local_id(d) + group_id(d) * local_size(d). Now, my code throws get_local_size(0) = 1 - looks like very small workgroups? – blue2script Jun 15 '13 at 21:30
  • Maybe the answer is here: "As you have stated, barriers may only synchronize threads in the same workgroup. There is no way to synchronize different workgroups in a kernel." (http://stackoverflow.com/questions/6890302/barriers-in-opencl) But what is the alternative? – blue2script Jun 15 '13 at 21:34
  • 1
    I think I have the answer now: first, we have to expand the number of workers in one workgroup, second we have to make sure all threads hit the barriers (which the last one does not!). I'll post the answer as soon as my 8-hour moratorium of stackoverflow is over - i.e. tomorrow morning. Thanks again! – blue2script Jun 15 '13 at 22:18
  • The link you gave is resourceful. Sorry for the gap, you can use a private float variable to use as temporary variable to do c[i]=c[i+1]+a[i] thing with using f1=c[i+1]+a[i] then in synch barriers c[i]=f1 – huseyin tugrul buyukisik Jun 15 '13 at 22:33