I have a kernel which uses about 2GB local memory. My cudaMalloc
which tries to alloc 2.5GB memory fails if I run that kernel_func
before.
I found out that 2GB memory is still occupied after finished running kernel_func
, which leaves only 1.5GB for my cudaMalloc
. Does anyone has a solution or explanation?
I know that using global memory for kernel_func
can solve the problem but for some reason I need to use local memory for that huge local static array.
__global__ kernel_func() {
// The huge static array goes here
short my_array[50000];
}
int main() {
kernel_func<<<64, 128>>>();
// my_array is still occupying memory are this point
// This cudaMalloc will fail with insufficient memory
cudaMalloc(/* 2.5GB data */);
}