I was trying to implement FDTD equations on the GPU. I initially
had implemented the kernel which used global memory. The memory
coalescing wasn't that great. Hence I implemented another kernel
which used shared memory to load the values. I am working on a grid
of 1024x1024
.
The code is below
__global__ void update_Hx(float *Hx, float *Ez, float *coef1, float* coef2){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
__shared__ float Ez_shared[BLOCKSIZE_HX][BLOCKSIZE_HY + 1];
/*int top = offset + x_index_dim;*/
if(threadIdx.y == (blockDim.y - 1)){
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Ez_shared[threadIdx.x][threadIdx.y + 1] = Ez[offset + x_index_dim];
}
else{
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
}
}
The constants BLOCKSIZE_HX
= 16
and BLOCKSIZE_HY
= 16
.
When I run the visual profiler, it still says that the memory is not coalesced.
EDIT:
I am using GT 520 graphic card with cuda compute capability of 2.1.
My Global L2 transactions / Access = 7.5
i.e there is 245 760
L2 transactions for
32768
executions of the line
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Global memory load efficiency
is 50%
.
Global memory load efficiency
= 100 * gld_requested_throughput/ gld_throughput
I am not able to figure out why there are so many memory accesses, though my threads are looking at 16 consecutive values. Can somebody point to me what I am doing wrong?
EDIT: Thanks for all the help.