19

I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.

__kernel void myKernel(
        const int length, 
        const int height, 
        and a bunch of other parameters) {

    //declare some local arrays to be shared by all 100 work item in this group
    __local float LP [length];
    __local float LT [height];
    __local int bitErrors = 0;
    __local bool failed = false;

    //here come my actual computations which utilize the space in LP and LT
}

This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?

All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...

user1111929
  • 6,050
  • 9
  • 43
  • 73

3 Answers3

29

It's relatively simple, you can pass the local arrays as arguments to your kernel:

kernel void myKernel(const int length, const int height, local float* LP, 
                     local float* LT, a bunch of other parameters) 

You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:

clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 3, height* sizeof(cl_float), NULL);

local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.

Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)

As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.

0xF
  • 3,214
  • 1
  • 25
  • 29
Grizzly
  • 19,595
  • 4
  • 60
  • 78
  • Hmmm... so how should I resolve this workgroup size issue? The 100 is a nominal value, it can vary depending on the exact instance of the problem, but I need these local memory variables for each 1x100 subblock of the global input. I assume there is no way of making a variable shared among its proper 1x100 subblock if that subblock is not a workgroup? (And as for the side note, I never tried without the __, thanks for the hint!) – user1111929 Jan 17 '12 at 07:30
  • I should probably be passing a two-dimensional array then to the memory, but how do I do this? I can pass something along the lines of `clSetKernelArg(kernel, 2, length * local_work_size[0] * sizeof(cl_float), NULL);` but this is a one-dimensional array. It would probably be better to make it 2-dimensional, or not? – user1111929 Jan 17 '12 at 07:41
  • Also, the bool and int were not fine, I get an `error: variable "bitErrors" may not be initialized`. If I replace it with your local statements above, I get `error: a parameter cannot be allocated in a named address space`. Of course I can make it an array of length 1, but that is not really the most beautiful solution... :-) – user1111929 Jan 17 '12 at 07:49
  • @user1111929: You are correct, you can only share variables inside a workgroup. I just mentioned it in case you could change your workingsize. And passing the local arrays as onedimensional is really the way to go, opencl doesn't really support nested pointers. Regarding the `bool` and `int`: I did mean passing it as a pointer if it didn't work like that. – Grizzly Jan 17 '12 at 14:51
  • 3
    I believe the second argument to the second `clSetKernelArg()` call should be `3`. As posted, this would set the same argument twice. – Reto Koradi Nov 17 '15 at 23:27
2

You could also declare your arrays like this:

__local float LP[LENGTH];

And pass the LENGTH as a define in your kernel compile.

int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);
David B
  • 71
  • 3
  • 1
    This is useful to know, but doesn't address the question about how to dynamically allocate during run-time (not at compile-time). – MasterHD Sep 03 '18 at 15:28
  • 1
    @MasterHD OpenCL code can (and most simple examples indeed are) compiled at runtime, so this answer does addresses the question. – Dinei May 02 '20 at 03:18
1

You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.

The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)


ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.

Community
  • 1
  • 1
youwei
  • 49
  • 7