My GPU GeForce GTX 1050 Ti has compute capability 6.1. According to the CUDA docs it has 96 KB of shared memory per streaming multiprocessor.
How to get this limit from the program?
I tried to call
cudaDeviceGetAttribute(&value, cudaDevAttrMaxSharedMemoryPerBlock, device);
and it returned the value 49152 = 48 KB, which is two times smaller than what I am expecting. Why this is so and how to get the actual maximum?