Following this question with reference to the shared memory example in the official guide, I'm trying to build the heat equation matrix, which is just like in this poorly drawn image I made
Here's what I've done so far, minimal example
#define N 32
#define BLOCK_SIZE 16
#define NUM_BLOCKS ((N + BLOCK_SIZE - 1)/ BLOCK_SIZE)
__global__ void heat_matrix(int* A)
{
const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ int temp_sm_A[N*N];
int* temp_A = &temp_sm_A[0]; memset(temp_A, 0, N*N*sizeof(int));
if (tid < N) //(*)
{
#pragma unroll
for (unsigned int m = 0; m < NUM_BLOCKS; ++m)
{
#pragma unroll
for (unsigned int e = 0; e < BLOCK_SIZE ; ++e)
{
if ( (tid == 0 && e == 0) || (tid == (N-1) && e == (BLOCK_SIZE-1) ) )
{
temp_A[tid + (e + BLOCK_SIZE * m) * N] = -2;
temp_A[tid + (e + BLOCK_SIZE * m) * N + ( tid==0 ? 1 : -1 )] = 1;
}
if ( tid == e )
{
temp_A[tid + (e + BLOCK_SIZE * m) * N - 1] = 1;
//printf("temp_A[%d] = 1;\n", (tid + (e + BLOCK_SIZE * m) * N -1));
temp_A[tid + (e + BLOCK_SIZE * m) * N] = -2;
//printf("temp_A[%d] = -2;\n", (tid + (e + BLOCK_SIZE * m) * N));
temp_A[tid + (e + BLOCK_SIZE * m) * N + 1] = 1;
//printf("temp_A[%d] = 1;\n", (tid + (e + BLOCK_SIZE * m) * N +1));
}
}
}
__syncthreads(); //(**)
memcpy(A, temp_A, N*N*sizeof(int));
}
}
int main(){
int* h_A = (int*)malloc(N*N*sizeof(int)); memset(h_A, 0, N*N*sizeof(int));
int* d_A;
checkCudaErrors(cudaMalloc((void**)&d_A, N*N*sizeof(int)));
checkCudaErrors(cudaMemcpy(d_A, h_A, N*N*sizeof(int), cudaMemcpyHostToDevice));
dim3 dim_grid((N/2 + BLOCK_SIZE -1)/ BLOCK_SIZE);
dim3 dim_block(BLOCK_SIZE);
heat_matrix <<< dim_grid, dim_block >>> (d_A);
checkCudaErrors(cudaMemcpy(h_A, d_A, N*N*sizeof(int), cudaMemcpyDeviceToHost));
...
}
The code is arranged to suit a large N (larger than 32). I took advantage of the block division. When executing nvcc
yields
CUDA error at matrix.cu:102 code=77(cudaErrorIllegalAddress) "cudaMemcpy(h_A, d_A, N*N*sizeof(int), cudaMemcpyDeviceToHost)"
And cuda-memcheck
provides only one error (actually there is another, but it comes from cudasuccess=checkCudaErrors(cudaDeviceReset()); ...
)
========= CUDA-MEMCHECK
========= Invalid __shared__ write of size 4
========= at 0x00000cd0 in heat_matrix(int*)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0xfffffffc is out of bounds
...
I can't see where I did wrong in the code. How can the thread 0
in the first block provoke an illegal access? There's even the specific if
case to deal with it, and there isn't reported the line of the code in which the error occurred.
Moreover, is there a more efficient way for my code than to deal with all those if
s? Sure there is, but I couldn't find a better parallel expression to split the cases into the second for
.
On a side note, to me the (*)
seems unnecessary; instead (**)
is necessary if I want to follow with other GPU function calls. Am I right?