44

I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.

For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:

__kernel square(
    __global float *input,
    __global float *output,
    const unsigned int count)
{
    int i = get_global_id(0);
    if (i < count)
        output[i] = input[i] * input[i];
}

If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
splicer
  • 5,344
  • 4
  • 42
  • 47

3 Answers3

34

Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.

Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.

__kernel square(
    __global float *input,
    __global float *output,
    __local float *temp,
    const unsigned int count)
{
    int gtid = get_global_id(0);
    int ltid = get_local_id(0);
    if (gtid < count)
    {
        temp[ltid] = input[gtid];
        // if the threads were reading data from other threads, then we would
        // want a barrier here to ensure the write completes before the read
        output[gtid] =  temp[ltid] * temp[ltid];
    }
}
Tom
  • 20,852
  • 4
  • 42
  • 54
  • 4
    I've read through the NVIDIA introductory material, and I still find the examples too complex. I'm looking for an über-simple 1-dimensional example of using local memory to get my feet wet. – splicer Apr 02 '10 at 12:54
  • 7
    Thanks for adding code in your last edit! I can't seem to get your kernel working though.... How would I use clSetKernelArg() for temp? Do I need to use clCreateBuffer() for temp? Also, there are a few typos in your kernel: "temp * temp" should be "temp[ltid] * temp[ltid]", and a closing brace should be inserted before the last line. – splicer Apr 03 '10 at 22:56
  • Running on the CPU under Snow Leopard, I tried clSetKernelArg(kernel, 2, sizeof(cl_float), NULL); but it crashes. Any ideas? – splicer Apr 03 '10 at 23:11
  • 3
    I corrected the typos - serves me right for typing on ipod. Your clSetKernelArg is not allocating enough memory though, you need space for one cl_float per thread (you have only allocated one float). Try: `clSetKernelArg(kernel, 2, sizeof(cl_float) * local_work_size[0], NULL);` where `local_work_size[0]` is the work group size in dimension 0. – Tom Apr 04 '10 at 14:36
  • Thanks! Looks like you're missing a semicolon on line 11. On the CPU, get_local_size(0) returns 1 for me, so shouldn't my use clSetKernelArg work? Is this a bug in Apple's implementation? – splicer Apr 06 '10 at 13:37
  • 12
    Note that you can declare variables as local with the qualifier `__local`. For example, you could do `__local float values[GROUP_SIZE];` then have each thread write `values[get_local_id(0)] = ...`. Local memory doesn't need to be reached via a pointer passed into the kernel. –  May 26 '10 at 08:25
  • @EdwardLuong: actually, your suggestion does not seem to work, at least not on my Macbook AMD Radeon HD 6750M. Wish it did though, this could make my Objective-C OpenCL Class a lot simpler. FYI, when implementing this approach, I get error code = -11; clBuildProgram (a CL_BUILD_PROGRAM_FAILURE). Thanks anyway. – Bruce Dean Jun 04 '12 at 20:30
  • 1
    Remember local storage is only visible to the same work group. So you work group size needs to be > 1 to share it across multiple threads. – Tim Child Aug 01 '12 at 21:59
  • I thought one has to use a local memory barrier after writing to local? – mike Apr 19 '15 at 15:49
  • 1
    @mike: in the example above each thread is using a unique part of the local memory as its own personal scratchpad, so a barrier is not required. However, if threads were going to communicate through the local memory, i.e. read data written by another thread, then yes a barrier would be required. Given that this is a response to a beginner, I should have mentioned this since it's an obvious pitfall in the future. – Tom May 02 '15 at 16:21
  • Will copying the data into local buffer as above improve the speed of the kernel? I am thinking the hardware might have fetched data into the L1 cache. – fyquah95 Oct 03 '15 at 21:28
31

There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:

__local float localBuffer[1024];

This removes code due to less clSetKernelArg calls.

Rick-Rainer Ludwig
  • 2,371
  • 1
  • 26
  • 42
  • 1
    This is true but it would be way more useful if you didn't have to know the size of the array at run-time. This is desirable when encapsulating OpenCL functionality within and Object Class. E.g., see EdwardLuong's comment above; it would be great if his suggestion could work (does not seem to work for my hardware). Thanks. – Bruce Dean Jun 04 '12 at 20:36
5

In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.

If you are not careful with local memory, you may end up with worse performance some time than using global memory.

Hunter Wang
  • 179
  • 1
  • 5