In a CUDA kernel, how do I store an array in "local thread memory"?

Question

I'm trying to develop a small program with CUDA, but since it was SLOW I made some tests and googled a bit. I found out that while single variables are by default stored within the local thread memory, arrays usually aren't. I suppose that's why it takes so much time to execute. Now I wonder: since local thread memory should be at least of 16KB and since my arrays are just like 52 chars long, is there any way (syntax please :) ) to store them in local memory?

Shouldn't it be something like:

__global__ my_kernel(int a)
{
  __local__ unsigned char p[50];
}

If your data are constant, and if all threads in a block or warp access the same index at the same instruction, then you should consider putting the data in a `__constant__` array. See the programming guide for full details. — harrism, Apr 25 '12 at 00:14

Domi · Answer 1 · 2022-04-18T09:55:01.707

Arrays, local memory and registers

There is a misconception here regarding the definition of "local memory". "Local memory" in CUDA is actually global memory (and should really be called "thread-local global memory") with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread's data blocked together). If you want things to be really fast, you want to use either shared memory, or, better yet, registers (especially on the latest devices where you get up to 255 registers per thread). Explaining the entire CUDA memory hierarchy falls out of scope of this post. Let us instead focus on making small array computations fast.

Small arrays, just like variables can be stored entirely in registers. On current NVIDIA hardware, however, putting arrays into registers is difficult. Why? Because registers need very careful treatment. If you don't do it exactly right, your data will end up in local memory (which, again, is really global memory, which is the slowest memory you have). The CUDA Programming Guide, section 5.3.2 tells you when local memory is used:

Local Memory

Local memory accesses only occur for some automatic variables as mentioned in Variable Type Qualifiers. Automatic variables that the compiler is likely to place in local memory are:

Arrays for which it cannot determine that they are indexed with constant quantities,

Large structures or arrays that would consume too much register space,

Any variable if the kernel uses more registers than available (this is also known as register spilling).

How does register allocation work?

Note that register allocation is an extremely complicated process which is why you cannot (and should not) interfere with it. Instead, the compiler will convert CUDA code into PTX code (a sort of bytecode) which assumes a machine with infinitely many registers. You can write inline PTX but it won't do too much to register allocation. PTX code is device-independent code and it is only the first stage. In a second stage, PTX will be compiled into device assembly code, called SASS. SASS code has the actual register allocations. The SASS compiler and it's optimizer will also be the ultimate authority on whether a variable will be in registers or local memory. All you can do is try to understand what the SASS compiler does in certain cases and use that for your advantage. Code correlation view in Nsight can help you with that (see below). However, since the compiler and optimizer keep changing, there is no guarantees as to what will or will not be in registers.

Insufficient registers

Appendix G, section 1 tells you how many registers a thread can have. Look for "Maximum number of 32-bit registers per thread". In order to interpret that table, you must know your compute capability (see below). Don't forget that registers are used for all kinds of things, and don't just correlate to single variables. Registers on all devices up to CC 3.5 are 32 bit each. If the compiler is smart enough (and the CUDA compiler keeps changing), it can for example pack multiple bytes into the same register. The Nsight code correlation view (see "Analyzing Memory Accesses" below) also reveals that.

Constant vs. Dynamic Indexing

While the space constraint is an obvious hurdle to in-register arrays, the thing that is easily overseen is the fact that, on current hardware (Compute Capability 3.x and below), the compiler places any array in local memory that is accessed with dynamic indexing. A dynamic index is an index which the compiler cannot figure out. Arrays accessed with dynamic indices can't be placed in registers because registers must be determined by the compiler, and thus the actual register being used must not depend on a value determined at run-time. For example, given an array arr, arr[k] is constant indexing if and only if k is a constant, or only depends on constants. If k, in any way, depends on some non-constant value, the compiler cannot compute the value of k and you got dynamic indexing. In loops where k starts and ends at a (small) constant numbers, the compiler (most probably) can unroll your loop, and can still achieve constant indexing.

Example

For example, sorting a small array can be done in registers but you must use sorting networks or similarly "hard-wired" approaches, and can't just use a standard algorithm because most algorithms use dynamic indexing.

With quite a high probability, in the following code example, the compiler keeps the entire aBytes array in registers because it is not too large and the loops can fully be unrolled (because the loop iterates over a constant range). The compiler (very probably) knows which register is being accessed at every step and can thus keep it fully in registers. Keep in mind that there are no guarantees. The best you can do is to verify it on a case-by-case basis using CUDA developer tools, as described below.

__global__
void
testSortingNetwork4(const char * aInput, char * aResult)
{
    const int NBytes = 4;

    char aBytes[NBytes];

    // copy input to local array
    for (int i = 0; i < NBytes; ++i)
    {
        aBytes[i] = aInput[i];
    }

    // sort using sorting network
    CompareAndSwap(aBytes, 0, 2); CompareAndSwap(aBytes, 1, 3); 
    CompareAndSwap(aBytes, 0, 1); CompareAndSwap(aBytes, 2, 3); 
    CompareAndSwap(aBytes, 1, 2); 


    // copy back to result array
    for (int i = 0; i < NBytes; ++i)
    {
        aResult[i] = aBytes[i];
    }
}

Analyzing memory accesses

Once you are done, you generally want to verify whether the data is actually stored in registers or whether it went to local memory. The first thing you can do is to tell your compiler to give you memory statistics using the --ptxas-options=-v flag. A more detailed way of analyzing memory accesses is using Nsight.

Nsight has many cool features. Nsight for Visual Studio has a built-in profiler and a CUDA <-> SASS code correlation view. The feature is explained here. Note that Nsight versions for different IDEs are probably developed independently, and thus their features might vary between the different implementations.

If you follow the instructions in above link (make sure to add the corresponding flags when compiling!), you can find the "CUDA Memory Transactions" button at the very bottom of the lower menu. In that view, you want to find that there is no memory transaction coming from the lines that are only working on the corresponding array (e.g. the CompareAndSwap lines in my code example). Because if it does not report any memory access for those lines, you (very probably) were able to keep the entire computation in registers and might just have gained a speed up of thousands, if not tenthousands, of percent (You might also want to check the actual speed gain, you get out of this!).

Figuring out Compute Capability

In order to figure out how many registers you have, you need to know your device's compute capability. The standard way of getting such device information is running the deviceQuery sample.

(Update - as mentioned by paleonix in the comments) deviceQuery is part of the official cuda-samples repo. You can find it here.

If you have Nsight for Visual Studio, just go to Nsight -> Windows -> System Info.

Don't optimize early

I am sharing this today because I came across this very problem very recently. However, as mentioned in this thread, forcing data to be in registers is definitely not the first step you want to take. First, make sure that you actually understand what is going on, then approach the problem step by step. Looking at the assembly code is certainly a good step, but it should generally not be your first. If you are new to CUDA, the CUDA Best Practices Guide will help you figure out some of those steps.

Since at least compute capability 2.0, local memory is on-chip. The SM's local memory is split between shared memory and "register overspill" memory, or local memory. It's the same physical memory as L1 cache, and can be configured with the option -Xptxas -dlcm=cg. This reduces shared mem to 16 KiB, leaving 48 KiB as L1/local memory. See _Cuda Application Design and Development_, Rob Farber, Elsevier, Chapter 5. — jspencer, Dec 09 '14 at 00:08
@jspencer I think the author of the book is a bit loose with the definition of local memory. I quote the official CUDA Programming Guide, section 5.3.2: "The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing as described in Device Memory Accesses. Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs." — Domi, Dec 09 '14 at 02:46
The C Programming Guide (CPG) seems loose, too--clearly local is logically separate from global memory, but the location is vague. Talk of coalescing suggests Global as the location. Maybe if the contents are generated by the SM, it's initially cached in L1 and will achieve shared mem access times if not evicted but can be subject to global access times (if evicted). The CPG says in 5.3.2, below the above quote: "On devices of compute capability 2.x, local memory accesses are always cached in L1 and L2 in the same way as global memory accesses." L1 caching of local memory can't be disabled. — jspencer, Dec 10 '14 at 19:05
@jspencer I am not sure, I get your point. If you think that I wrote something that was incorrect, please be more precise. — Domi, Dec 10 '14 at 23:29
My point was I don't think the _documentation_ is precise. But I think the caching explanation for why local memory could be considered on-chip is a reasonable theory. Farber is an insider, and his book is copyright nVidia, so they have a confusing front on this issue. I was just trying to rectify the two published references, but I can't say you're wrong or that this theory is right. — jspencer, Dec 16 '14 at 00:35
@jspencer Over time, the compiler and architecture will be smarter and smarter about where to allocate local memory. I agree: Maintaining that all "local memory" is and always will be off-chip is certainly not a good idea. Even now, the caches can mitigate a lot of the round-trip overhead, but depending on your settings and current driver implementations, that could apply to any kind of global memory operation, not just "local memory". Nevertheless, the term "local" is applied strictly to the logical, not the physical memory model. — Domi, Apr 21 '15 at 09:01
It strange, but `-maxrregcount=72` (against 94 registers used w/o this option) gives exactly 1.5x speedup for CUDA-raytracer. — Tomilov Anatoliy, May 30 '18 at 17:49
As pointed out [here](https://devtalk.nvidia.com/default/topic/537188/implications-of-the-default-setting-0-for-max-used-register-maxrregcount-/) `maxrregcount` is "mostly" obsolete. A hard register limit can force occupancy to go up, but can also have a whole lot of unwanted side effects, which is why fine-tuning through `launch bounds` is the new way of going about doing things. Can you get the same speed up with those? — Domi, May 31 '18 at 16:23
In the latest CUDA version (11.6), the CUDA samples were moved completely to the [Github repo](https://github.com/NVIDIA/cuda-samples). In the CUDA toolkit you will only find a README with a link to the repo. — paleonix, Apr 18 '22 at 09:40
Although the cuda manual says enough about the local memory, the 1st link is missing and also not valid in [the wayback machine](https://web.archive.org/web/20220915000000*/https://graphics.cg.uni-saarland.de/fileadmin/cguds/courses/ss14/pp_cuda/slides/02_-_CUDA_Memory_Hierarchy.pdf). Do you have its copy? Or could you give some summary about the link? @Domi — zg c, Jul 30 '23 at 07:10

talonmies · Answer 2 · 2012-04-25T05:36:17.147

11

All you need is this:

__global__ my_kernel(int a)
{
    unsigned char p[50];
    ........
}

The compiler will automatically spill this to thread local memory if it needs to. But be aware that local memory is stored in SDRAM off the GPU, and it is as slow as global memory. So if you are hoping that this will yield a performance improvement, it might be that you are in for a disappointment.....

edited Apr 25 '12 at 05:36

answered Apr 24 '12 at 11:47

talonmies

70,661
34
192
269

Oh.. good.. so.. is there any chance to store those 50 chars in registry memory? – Matteo Monti Apr 24 '12 at 11:53
This is automatically stored as registers if you use a constant array size. Only if you exceed the amount of registers data is stored in local thread memory. – djmj Apr 24 '12 at 12:46
Very little, I would have thought - depending on the architecture you only have 63 or 127 total registers at your disposal. Packing the data into 25 `uchar4` might be a possibility, though (registers are 32 bit), if you can unroll your code to work on a vector type. – talonmies Apr 24 '12 at 12:46
2

@djmj: No it isn't. Unless the compiler can completely unroll array access and identify every element access explicitly at compile time, it will be spilled to local memory. – talonmies Apr 24 '12 at 12:47
I just assumed that he wants to store it in registry and not local memory as his comment suggested. So he should use a constant value to speed up his algorithm. Depending on his architecture and setup and rest of algorithm it can be possible as you stated to store in registry. – djmj Apr 24 '12 at 12:52
@djmj:Even if the array has a small, constant size, the compiler won't *automatically* store the array in registers unless the complete runtime codepath of each entry in the array can be fully deduced by the compiler. This behaviour is similar to the strict rules applied to loop unrolling. Otherwise the array will be spilled into local memory. – talonmies Apr 24 '12 at 13:03
In Cuda Guide 4.1 - 5.3.2.2 does not says anything about it, or you have some knowledge about it which is not that straight forward, since I find no reason and difference to store something in local memory if enough free registers are left. You have some source for me to read about it? – djmj Apr 24 '12 at 13:09
1

@djmj: talonmies is referring to the statement that "arrays for which it cannot determine that they are indexed with constant quantities [will be placed in local memory]". – Tom Apr 24 '12 at 14:11
Ok, i was refering to constant as a #DEFINE. Are we sure with #DEFINE it is in registers if register count is enough? – djmj Apr 24 '12 at 18:08
1

How would you #define run-time data? – harrism Apr 25 '12 at 00:11
Who was talking about run-time data? I am just answering the opening question where the array has a constant size of 50. – djmj Apr 25 '12 at 00:42
@djmj: It isn't whether the size is constant which is at question here (that is a non-sequitur, CUDA only supports statically declared arrays anyway). It is whether the *indexing* of the array in the code is at all times done with constant values which determines whether the compiler will spill to local memory or not. This is why your #DEFINE comment makes no sense and why harrism commented as he did. It is not how you declare/define the array that determines whether it will be in registers, it is *how it is used in the code* – talonmies Apr 25 '12 at 05:34

score 1 · Answer 3 · answered Jul 23 '19 at 19:15

~ For someone that runs across this in the future ~

In a nutshell, to create an array for each thread, you would want to create them in device memory. To do this, a little bit of shared memory can be carved out per thread. Special attention must be taken to prevent conflicts or performance will drop.

Here is an example from an nvidia blog post by Maxim Milakov in 2015:

// Should be multiple of 32
#define THREADBLOCK_SIZE 64 
// Could be any number, but the whole array should fit into shared memory 
#define ARRAY_SIZE 32 

__device__ __forceinline__ int no_bank_conflict_index(int thread_id, int logical_index)
{
    return logical_index * THREADBLOCK_SIZE + thread_id;
}

__global__ void kernel5(float * buf, int * index_buf)
{
    // Declare shared memory array A which will hold virtual 
    // private arrays of size ARRAY_SIZE elements for all 
    // THREADBLOCK_SIZE threads of a threadblock
    __shared__ float A[ARRAY_SIZE * THREADBLOCK_SIZE]; 
    ...
    int index = index_buf[threadIdx.x + blockIdx.x * blockDim.x];

    // Here we assume thread block is 1D so threadIdx.x 
    // enumerates all threads in the thread block
    float val = A[no_bank_conflict_index(threadIdx.x, index)];
    ...
}

score -1 · Answer 4 · edited May 05 '15 at 20:05

You are mixing up local and register memory space.

Single variables and constant sized arrays are automatically saved in register space on the chip with almost no costs for read and write.

If you exceed your amount of registers per multiprocessor they will get stored in local memory.

Local memory resides in global memory space and has the same slow bandwidth for read and write operations.

#DEFINE P_SIZE = 50

__global__ void kernel()
{
    unsigned char p[P_SIZE];
}

score -1 · Answer 5 · answered Apr 24 '12 at 13:31

-1

The keyword you are looking for is __shared__ . Large arrays will not fit in the shared memory space, but the compiler should used shared memory for a small fixed-size array like in this case. You can use the __shared__ keyword to ensure this happens. You will see a compile-time error if you exceed the maximum amount of shared memory for a block.

answered Apr 24 '12 at 13:31

Bruce Hart

204
2
3

shared memory should be used if the data is shared between threads, which is not the case here – djmj Apr 24 '12 at 17:58
11

@djmj is providing many misleading comments. `__shared__` memory is useful for more than just sharing data. It can be used as a per-thread scratchpad -- beneficial if the same data are accessed multiple times per thread. It can also be used for efficient reordering of data so that global memory accesses can be done sequentially (see the transpose CUDA SDK sample for an example of this). – harrism Apr 25 '12 at 00:12
2

The compiler will not automatically use shared memory, you *must* use the __shared__ specifier. – harrism Apr 25 '12 at 00:13