0

Is there a way to see how much shared memory is reserved by a kernel, i.e. how much shared memory isn't participating in the L1 cache?

I have a workload of fully independent threads (embarrassingly parallel), and I have no need for shared memory. However, the work is very cache size constrained, and I want to ensure there's no "pinned" shared memory kept in reserve and not participating in the L1 cache.

Is there some way to inspect the output of a (profiling?) run to ensure no shared memory was allocated? Or equivalently that the entirety of the physical L1 cache was used for the logical L1 cache?

Relatedly, I'm very confused by the existence of the cudaFuncSetCacheConfig function. Given that the amount of shared memory is known at kernel launch time, what does it mean to "PreferL1" or "PreferShared" - why not allocate exactly what's needed to shared and keep the rest for L1?

I'm on Ampere, CC 8.6.

talonmies
  • 70,661
  • 34
  • 192
  • 269
emchristiansen
  • 3,550
  • 3
  • 26
  • 40

1 Answers1

2

Is there some way to inspect the output of a (profiling?) run to ensure no shared memory was allocated?

Using nsight compute CLI, the "Launch Statistics" section will tell you how much statically allocated shared memory and how much dynamically allocated shared memory was requested by that kernel launch. Example:

Section: Launch Statistics
----------------------------------------- --------------- ------------------------------
Block Size                                                                         1,024
Grid Size                                                                          1,024
Registers Per Thread                      register/thread                             16
Shared Memory Configuration Size                     byte                              0
Dynamic Shared Memory Per Block                byte/block                              0 <-
Static Shared Memory Per Block                 byte/block                              0 <-
Threads                                            thread                      1,048,576
Waves Per SM                                                                        6.40
----------------------------------------- --------------- ------------------------------

It's preferred that you ask one question per question on SO. Regarding your second question, the preferred method of setting shared memory configuration on a cc8.x device is given in the programming guide and it does not use cudaFuncSetCacheConfig. The principal reason that an API of this kind is required is given in that programming guide section:

Devices of compute capability 8.0 allow a single thread block to address up to 163 KB of shared memory, while devices of compute capability 8.6 allow up to 99 KB of shared memory. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, and must use dynamic shared memory rather than statically sized shared memory arrays. These kernels require an explicit opt-in by using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize;

Also see here.

The function you reference - cudaFuncSetCacheConfig predates GPU architectures that had more than 48KB of shared memory accessible from a single block in device code, and primarily has these earlier architectural generations in view. Also see here.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257