Is there some way to inspect the output of a (profiling?) run to ensure no shared memory was allocated?
Using nsight compute CLI, the "Launch Statistics" section will tell you how much statically allocated shared memory and how much dynamically allocated shared memory was requested by that kernel launch. Example:
Section: Launch Statistics
----------------------------------------- --------------- ------------------------------
Block Size 1,024
Grid Size 1,024
Registers Per Thread register/thread 16
Shared Memory Configuration Size byte 0
Dynamic Shared Memory Per Block byte/block 0 <-
Static Shared Memory Per Block byte/block 0 <-
Threads thread 1,048,576
Waves Per SM 6.40
----------------------------------------- --------------- ------------------------------
It's preferred that you ask one question per question on SO. Regarding your second question, the preferred method of setting shared memory configuration on a cc8.x device is given in the programming guide and it does not use cudaFuncSetCacheConfig
. The principal reason that an API of this kind is required is given in that programming guide section:
Devices of compute capability 8.0 allow a single thread block to address up to 163 KB of shared memory, while devices of compute capability 8.6 allow up to 99 KB of shared memory. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, and must use dynamic shared memory rather than statically sized shared memory arrays. These kernels require an explicit opt-in by using cudaFuncSetAttribute()
to set the cudaFuncAttributeMaxDynamicSharedMemorySize
;
Also see here.
The function you reference - cudaFuncSetCacheConfig
predates GPU architectures that had more than 48KB of shared memory accessible from a single block in device code, and primarily has these earlier architectural generations in view. Also see here.