CUDA: When to use shared memory and when to rely on L1 caching?

Question

After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background?

Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications?

To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn't have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don't have to wait for data to actually make it all the way out to global memory.

With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it's better to cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?

score 14 · Answer 1 · edited May 05 '23 at 16:17

There are two big reasons why automatic caching is less efficient than manual scratchpad memory (applies to CPUs as well):

Parallel accesses to random addresses are more efficient. Example: histograming. Let's say you want to increment N bins, and each are > 256 bytes apart. Then due to coalescing rules, that will result in N serial reads/writes since global and cache memory is organized in large ~256byte blocks. Shared memory doesn't have that problem.

Also to access global memory, you have to do virtual to physical address translation. Having a TLB that can do lots of translations in || will be quite expensive. I haven't seen any SIMD architecture that actually does vector loads/stores in || and I believe this is the reason why.
It avoids writing back dead values to memory, which wastes bandwidth & power. Example: In an image processing pipeline, you don't want your intermediate images to get flushed to memory.

Also, according to an NVIDIA employee, current L1 caches are write-through (immediately writes to L2 cache), which will slow down your program.

So basically, the caches get in the way if you really want performance.

Compute Capability 2.* and 3.* invalidate L1 cache line on write. Compute capability 3.0-3.5 do not cache global reads in L1. On compute capability 3.* devices the shared memory bandwidth with 8 bytes per bank is actually 256 bytes/clk whereas L1 is limited to 128 bytes from a cache line. As stated by Yale shared memory has bank conflicts (all access must be to different banks or same address in a bank) whereas L1 has address divergence (all address must be in the same 128 byte cache line) so shared memory is much more efficient at random access. — Greg Smith, Jul 15 '13 at 22:56
Let me offer a conjecture on why SIMD memory access is practically non-existent on general purpose processors (e.g. Intel AVX2 has a gather, but it's really serial). I'm quite convinced it's because of the big cost of doing virtual to physical address translations, which shared memory access doesn't need because it's its own address space. Imagine the cost of having to do 32 TLB lookups in parallel! Maybe there's an optimization if all 32 addresses fall in the same page? — Yale Zhang, Dec 02 '14 at 19:16

score 10 · Accepted Answer · answered Jul 01 '12 at 03:11

As far as i know, L1 cache in a GPU behaves much like the cache in a CPU. So your comment that "This is as opposed to values in L1, which get evicted if they are not used often enough" doesn't make much sense to me

Data on L1 cache isn't evicted when it isn't used often enough. Usually it is evicted when a request is made for a memory region that wasn't previously in cache, and whose address resolves to one that is already in use. I don't know the exact caching algorithm employed by NVidia, but assuming a regular n-way associative, then each memory entry can only be cached in a small subset of the entire cache, based on it's address

I suppose this may also answer your question. With shared memory, you get full control as to what gets stored where, while with cache, everything is done automatically. Even though the compiler and the GPU can still be very clever in optimizing memory accesses, you can sometimes still find a better way, since you're the one who knows what input will be given, and what threads will do what (to a certain extent of course)

Thank you, that does answer my question. I had pictured the cache as being able to keep track of which elements were being used the most, and prefer to cache those. I've read up on n-way associative caches now and it looks to me like the main problem is that they may throw out a value that is often used simply because another cache line fits in that slot. — Roger Dahl, Jul 01 '12 at 14:49
I think that means that a good strategy for writing CUDA programs may often be to first write the algorithm to use only global memory and see if L1 works well enough that memory latency is hidden. And then consider hand optimizing with shared memory if the algorithm turns out to be memory bound. — Roger Dahl, Jul 01 '12 at 14:53

score 2 · Answer 3 · edited May 05 '23 at 16:21

Caching data through several memory layers always needs to follow a cache-coherency protocol. There are several such protocols and the decision on which one is the most suitable is always a trade off.

You can have a look at some examples:

Related to GPUs: Cache Coherence for GPU Architectures (DOI: 10.1109/HPCA.2013.6522351)
Generally for computing units: An overview of on-chip cache coherence protocols (DOI: 10.1109/IntelliSys.2017.8324309)

I don't want to get in many details, because it is a huge domain and I am not an expert. What I want to point out is that in a shared-memory system (here the term shared does not refer to the so called shared memory of GPUs) where many compute-units (CUs) need data concurrently there is a memory protocol that attempts to keep the data close to the units so that can fetch them as fast as possible. In the example of a GPU when many threads in the same SM (symmetric multiprocessor) access the same data there should be a coherency in the sense that if thread 1 reads a chunk of bytes from the global memory and in the next cycle thread 2 is going to access these data, then an efficient implementation would be such that thread 2 is aware that data are found already in L1 cache and can access it fast. This is what the cache coherency protocol attempts to achieve, to let all compute units be up to date with what data exist in caches L1, L2 and so on.

However, keeping threads up to date, or else, keeping threads in coherent states, comes at some cost which is essentially missing cycles.

In CUDA by defining the memory as shared rather than L1-cache you free it from that coherency protocol. So access to that memory (which is physically the same piece of whatever material it is) is direct and does not implicitly call the functionality of coherency protocol.

I don't know how fast should this be, I didn't perform any such benchmark but the idea is that since you don't pay anymore for this protocol the access should be faster!

Of course, the shared memory on NVIDIA GPUs is split in banks and if someone wants to use it for performance improvement should have a look at this before. The reason is bank conflicts that occur when two threads access the same bank and this causes serialization of the access..., but that's another thing link

CUDA: When to use shared memory and when to rely on L1 caching?

3 Answers3