OpenCL - Local Memory

Question

I do understand whats the difference between global- and local-memory in general. But I have problems to use local-memory.

1) What has to be considered by transforming a global-memory variables to local-memory variables?

2) How do I use the local-barriers?

Maybe someone can help me with a little example.

I tried to do a jacobi-computation by using local-memory, but I only get 0 as result. Maybe someone can give me an advice.

Working Solution:

#define IDX(_M,_i,_j) (_M)[(_i) * N + (_j)]
#define U(_i, _j)     IDX(uL, _i, _j)

__kernel void jacobi(__global VALUE* u, __global VALUE* f, __global VALUE* tmp, VALUE factor) {

int i = get_global_id(0);
int j = get_global_id(1);

int iL = get_local_id(0);
int jL = get_local_id(1);

__local VALUE uL[(N+2)*(N+2)];
__local VALUE fL[(N+2)*(N+2)];

IDX(uL, iL, jL) = IDX(u, i, j);
IDX(fL, iL, jL) = IDX(f, i, j);

barrier(CLK_LOCAL_MEM_FENCE);

IDX(tmp, i, j) = (VALUE)0.25 * ( U(iL-1, jL) + U(iL, jL-1) + U(iL, jL+1) + U(iL+1, jL) - factor * IDX(fL, iL, jL));

}

Thanks.

need to allocate for local arrays. such as __local VALUE uL[128]. Then you can get a pointer from that if you need to. Size must be known at compile time(at least for opencl 1.2) — huseyin tugrul buyukisik, Nov 14 '16 at 11:08
thank you again. what size should the __local VALUE uL[?] and __local VALUE fL[?] have when my matrices u, f and tmp have a size of, e.g 10x10? — SteveOhio, Nov 14 '16 at 11:23
They are acessing 1-st closest neighbours so if a work group works on 16x16 area, then local memory should have 1-wide safety lines outside so 18x18 needed. If 10x10 you said was about per work-group area, then it should have 12x12 so closest neighbor accesses don't overflow it and also it needs to access interior of local array such that x=0,y=0 for global should access to x=1,y=1 for local and 9,9 of global should access 10,10 of local so its closest neighbour 11,11 will still be in array bounds and no overflow. So you need more cell than thread number per workgroup which needs a secondload — huseyin tugrul buyukisik, Nov 14 '16 at 11:29
instead of copying element-wise, vload instructions can be used to get bigger chunks from main memory. — huseyin tugrul buyukisik, Nov 14 '16 at 11:45
ok, it works - thanks a lot. one last question: where could be the problem, when the result from local-memory-version is way bigger than the result from global-memory-version? in both version the same 10x10 matrices and 100 jacobi-iterations. - I already corrected the mistake in my matrix indices i, j to iL and jL. — SteveOhio, Nov 14 '16 at 11:52
if numbers are like 1231434252 or its negative, it could be accessing out of bounds (and getting garbage values, maybe even other variables?). Is this error now or dates back to first version? Do you mean single-load version? — huseyin tugrul buyukisik, Nov 14 '16 at 11:59
the global-memory version is from my professor, so i dont think there are any mistakes in. The values from the global-memory version are like 0.15 or -0.67, pretty low numbers. The local-memory version yields numbers like 34.95 or -5.82. Both versions get the same initial-matrices. — SteveOhio, Nov 14 '16 at 12:03
so there are normalized values. Are you doing that 100-times in kernel or are you running kernel 100 times? What about professors? — huseyin tugrul buyukisik, Nov 14 '16 at 12:06
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/128060/discussion-between-steveohio-and-huseyin-tugrul-buyukisik). — SteveOhio, Nov 14 '16 at 12:08
To me, local memory is only useful when coordinating with other work items in a work group to reduce global memory bandwidth. For example, when re-using values reads from global memory for other work items, or to be able to use coalesced reads from or writes to global memory. For example, during matrix multiplication. — Dithermaster, Dec 03 '16 at 19:42

huseyin tugrul buyukisik · Accepted Answer · 2016-11-14T10:10:08.770

1) Query for CL_DEVICE_LOCAL_MEM_SIZE value, it is 16kB minimum and increses for different hardwares. If your local variables can fit in this and if they are re-used many times, you should put them in local memory before usage. Even if you don't, automatic usage of L2 cache when accessing global memory of a gpu can be still effective for utiliation of cores.

If global-local copy is taking important slice of time, you can do async work group copy while cores calculating things.

Another important part is, more free local memory space means more concurrent threads per core. If gpu has 64 cores per compute unit, only 64 threads can run when all local memory is used. When it has more space, 128,192,...2560 threads can be run at the same time if there are no other limitations.

A profiler can show bottlenecks so you can consider it worth a try or not.

For example, a naive matrix-matrix multiplication using nested loop relies on cache l1 l2 but submatices can fit in local memory. Maybe 48x48 submatices of floats can fit in a mid-range graphics card compute unit and can be used for N times for whole calculation before replaced by next submatrix.

CL_DEVICE_LOCAL_MEM_TYPE querying can return LOCAL or GLOBAL which also says that not recommended to use local memory if it is GLOBAL.

Lastly, any memory space allocation(except __private) size must be known at compile time(for device, not host) because it must know how many wavefronts can be issued to achieve max performance(and/or maybe other compiler optimizations). That is why no recursive function allowed by opencl 1.2. But you can copy a function and rename for n times to have pseudo recursiveness.
2) Barriers are a meeting point for all workgroup threads in a workgroup. Similar to cyclic barriers, they all stop there, wait for all until continuing. If it is a local barrier, all workgroup threads finish any local memory operations before departing from that point. If you want to give some numbers 1,2,3,4.. to a local array, you can't be sure if all threads writing these numbers or already written, until a local barrier is passed, then it is certain that array will have final values already written.

All workgroup threads must hit same barrier. If one cannot reach it, kernel stucks or you get an error.

__local int localArray[64]; // not each thread. For all threads. 
                            // per compute unit.

if(localThreadId!=0)               
    localArray[localThreadId]=localThreadId; // 64 values written in O(1)
// not sure if 2nd thread done writing, just like last thread

if(localThreadId==0) // 1st core of each compute unit loads from VRAM
    localArray[localThreadId]=globalArray[globalThreadId];

barrier(CLK_LOCAL_MEM_FENCE); // probably all threads wait 1st thread
                              // (maybe even 1st SIMD or 
                              // could be even whole 1st wavefront!)
// here all threads written their own id to local array. safe to read.
// except first element which is a variable from global memory
// lets add that value to all other values
if(localThreadId!=0)
   localArrray[localThreadId]+=localArray[0];

Working example(local work group size=64):

inputs: 0,1,2,3,4,0,0,0,0,0,0,..

    __kernel void vecAdd(__global float* x )
    {
       int id = get_global_id(0);
       int idL = get_local_id(0);
       __local float loc[64];
       loc[idL]=x[id];
       barrier (CLK_LOCAL_MEM_FENCE);
       float distance_square_sum=0;
       for(int i=0;i<64;i++)
       { 
            float diff=loc[idL]-loc[i];
            float diff_squared=diff*diff;
            distance_square_sum+=diff_squared;
       }       
       x[id]=distance_square_sum;

    }

output: 30, 74, 246, 546, 974, 30, 30, 30...

thank you for the good explanation. can you prepare a short example for kernel-program-function which does a simple computation like a matrix multiplication with usage of local-memory. Big thanks. — SteveOhio, Nov 13 '16 at 16:22
also barriers can be used to reset branching if compiler is not doing already. — huseyin tugrul buyukisik, Nov 13 '16 at 16:56
I added my progam above .. maybe you can give me an advice. Thanks. — SteveOhio, Nov 14 '16 at 10:43
I also added the information: "any memory space allocation(except __private) size must be known at compile time" part 1 hour ago and commented to your question about it. — huseyin tugrul buyukisik, Nov 14 '16 at 11:11
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/128056/discussion-between-steveohio-and-huseyin-tugrul-buyukisik). — SteveOhio, Nov 14 '16 at 11:58
can we control this outside the source code as well? i.e.: using a front-end opencl compiler option? — Amir, Apr 05 '17 at 20:34
@Amir do you mean the size of local memory? Maybe some vendors can have extensions to adjust ratio between local memory and L1 memory? I don't know. Or do you mean choosing a dynamic array size outside of compiler? There are definitions for constant for that, at compile time as far as I know.. — huseyin tugrul buyukisik, Apr 05 '17 at 20:38
@huseyintugrulbuyukisik, I mean choosing whether or not to use local or the global memory without changing the source-code — Amir, Apr 05 '17 at 20:52
@Amir I don't know that. Generally every device needs some optimization different than others. — huseyin tugrul buyukisik, Apr 05 '17 at 21:03
sorry to mislead you maybe. looking at your other answer in http://stackoverflow.com/questions/21872810/whats-the-advantage-of-the-local-memory-in-opencl, I was wondering how you normally force a GPU to use its local memory rather than global? Is is done in the source-code or can be controlled outside — Amir, Apr 05 '17 at 21:05
@Amir you mean kernel string? Yes, you define local memory as a parameter or define declare it in kernel body and use — huseyin tugrul buyukisik, Apr 05 '17 at 21:08
@huseyintugrulbuyukisik, I have a more general question unanswered here as well I thought you might know: http://stackoverflow.com/questions/43238587/proper-way-of-compiling-opencl-applications-and-using-available-compiler-options — Amir, Apr 05 '17 at 21:09

OpenCL - Local Memory

1 Answers1