I just started to use OpenCL via the PyOpenCL interface from Python. I tried to create a very simple "recurrent" program where the outcome of each loop in every kernel depends on the output of another kernel from the last loop-cycle, but I am running into synchronization problems:
__kernel void part1(__global float* a, __global float* c)
{
unsigned int i = get_global_id(0);
c[i] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
if (i < 9)
{
for(int t = 0; t < 2; t++){
c[i] = c[i+1] + a[i];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
}
The host application is
import pyopencl as cl
from numpy import *
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
#read in the OpenCL source file as a string
f = open('recurrent.cl', 'r')
fstr = "".join(f.readlines())
#create the program
program = cl.Program(ctx, fstr).build()
mf = cl.mem_flags
#initialize client side (CPU) arrays
a = array(range(10), dtype=float32)
#create OpenCL buffers
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)
#execute program
program.part1(queue, a.shape, None, a_buf, dest_buf)
c = empty_like(a)
cl.enqueue_read_buffer(queue, dest_buf, c).wait()
print "a", a
print "c", c
The outcome is
a [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
c [ 0. 1. 5. 3. 4. 18. 13. 7. 8. 0.]
As you can see, some of the outcome values are correct. E.g. the third position = 5 = 3 + 2 but e.g. the second position is 2 = 0 + 2. So the summation was over the outcome of other threads at different time-points despite the barrier being in place. I thought the barrier would make sure all threads have reached it and have written their outcome to the global memory?
It's probably something very simple and I would appreciate any hints and comments!
PS: I am running this on a Sandy Bridge CPU using the Intel SDK.