I am trying to statically allocate a __shared__
variable using a device architecture of sm_86
. If the size of the variable is larger than 48 KB, nvcc
complains about it and fails to compile with error
Entry function uses too much shared data
I am running Visual Studio 2019 with CUDA 11.2 and have set
Configuration properties -> CUDA C/C++ -> Device -> Code Generation
to compute_86,sm_86
According to the table here should not devices of cc 8.6 allow for a maximum shared memory per block of 100 KB?
Here is a simple code sample that fails on me:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#define SIZE 12288 /* x 4 byte = 48 KB <---- Works */
/* #define SIZE 12289 */ /* <---- Fails */
__global__ void kernel(int* d) {
__shared__ int s[SIZE];
s[threadIdx.x] = 1;
d[threadIdx.x] = s[threadIdx.x];
}
int main() {
int* d_a;
cudaMalloc(&d_a, SIZE * sizeof(int));
kernel <<<1, 1024 >>> (d_a);
cudaDeviceSynchronize();
cudaFree(d_a);
}