Can anyone help me to understand why the following code causes a segmentation fault? Likewise, can anyone help me understand why swapping out the two lines labelled "bad" for the two lines labelled "good" does not result in a segmentation fault?
Note that the seg fault seems to occur at the cudaMalloc line; if I comment that out I also do not see a segmentation fault. These allocations seem to be stepping on each other, but I don't understand how.
The intent of the code is to set up three structures: h_P on the host, which will be populated by a CPU routine d_P on the device, which will be populated by a GPU routine h_P_copy on the host, which will be populated by copying the GPU data structure back in.
That way I can verify correct behavior and benchmark one vs the other.
All of those are, indeed, four-dimensional arrays.
(If it matters, the card in question is a GTX 580, using nvcc 4.2 under SUSE Linux)
#define NUM_STATES 32
#define NUM_MEMORY 16
int main( int argc, char** argv) {
// allocate and create P matrix
int P_size = sizeof(float) * NUM_STATES * NUM_STATES * NUM_MEMORY * NUM_MEMORY;
// float *h_P = (float*) malloc (P_size); **good**
// float *h_P_copy = (float*) malloc (P_size); **good**
float h_P[P_size]; // **bad**
float h_P_copy[P_size]; // **bad**
float *d_P;
cudaMalloc( (void**) &d_P, P_size);
cudaMemset( d_P, 0.0, P_size);
}