3

I would like to write an electromagnetic 2D Finite Difference Time Domain (FDTD) code in CUDA language. The C code for the update of the magnetic field is the following

// --- Update for Hy and Hx
for(int i=n1; i<=n2; i++)
   for(int j=n11; j<=n21; j++){
      Hy[i*ydim+j]=A[i*ydim+j]*Hy[i*ydim+j]+B[i*ydim+j]*(Ezx[(i+1)*ydim+j]-Ezx[i*ydim+j]+Ezy[(i+1)*ydim+j]-Ezy[i*ydim+j]);
  Hx[i*ydim+j]=G[i*ydim+j]*Hx[i*ydim+j]-H[i*ydim+j]*(Ezx[i*ydim+j+1]-Ezx[i*ydim+j]+Ezy[i*ydim+j+1]-Ezy[i*ydim+j]);
   }
}

My first parallelization attempt has been the following kernel:

__global__ void H_update_kernel(double* Hx_h, double* Hy_h, double* Ezx_h, double* Ezy_h, double* A_h, double* B_h,double* G_h, double* H_h, int n1, int n2, int n11, int n21)
{
   int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
   int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

   if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) {
      Hy_h[idx*ydim+idy]=A_h[idx*ydim+idy]*Hy_h[idx*ydim+idy]+B_h[idx*ydim+idy]*(Ezx_h[(idx+1)*ydim+idy]-Ezx_h[idx*ydim+idy]+Ezy_h[(idx+1)*ydim+idy]-Ezy_h[idx*ydim+idy]);
  Hx_h[idx*ydim+idy]=G_h[idx*ydim+idy]*Hx_h[idx*ydim+idy]-H_h[idx*ydim+idy]*(Ezx_h[idx*ydim+idy+1]-Ezx_h[idx*ydim+idy]+Ezy_h[idx*ydim+idy+1]-Ezy_h[idx*ydim+idy]); }

}

However, by also using the Visual Profiler, I have been unsatisfied by this solution for two reasons: 1) The memory accesses are poorly coalesced; 2) The shared memory is not used.

I then decided to use the following solution

__global__ void H_update_kernel(double* Hx_h, double* Hy_h, double* Ezx_h, double* Ezy_h, double* A_h, double* B_h,double* G_h, double* H_h, int n1, int n2, int n11, int n21)
{
    int i       = threadIdx.x;
int j       = threadIdx.y;
int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1  = j*BLOCK_SIZE_Y+i;

int i1      = (index1)%(BLOCK_SIZE_X+1);
int j1      = (index1)/(BLOCK_SIZE_Y+1);

int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);

__shared__ double Ezx_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];     
__shared__ double Ezy_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];     

if (((blockIdx.x*BLOCK_SIZE_X+i1)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j1)<ydim))
    Ezx_h_shared[i1][j1]=Ezx_h[(blockIdx.x*BLOCK_SIZE_X+i1)*ydim+(blockIdx.y*BLOCK_SIZE_Y+j1)];

if (((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))&&(((blockIdx.x*BLOCK_SIZE_X+i2)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j2)<ydim)))
    Ezx_h_shared[i2][j2]=Ezx_h[(blockIdx.x*BLOCK_SIZE_X+i2)*xdim+(blockIdx.y*BLOCK_SIZE_Y+j2)];

__syncthreads();

if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) {
    Hy_h[idx*ydim+idy]=A_h[idx*ydim+idy]*Hy_h[idx*ydim+idy]+B_h[idx*ydim+idy]*(Ezx_h_shared[i+1][j]-Ezx_h_shared[i][j]+Ezy_h[(idx+1)*ydim+idy]-Ezy_h[idx*ydim+idy]);
    Hx_h[idx*ydim+idy]=G_h[idx*ydim+idy]*Hx_h[idx*ydim+idy]-H_h[idx*ydim+idy]*(Ezx_h_shared[i][j+1]-Ezx_h_shared[i][j]+Ezy_h[idx*ydim+idy+1]-Ezy_h[idx*ydim+idy]); }

    } 

The index trick is needed to make a block of BS_x * BS_y threads read (BS_x+1)*(BS_y+1) global memory locations to the shared memory. I believe that this choice is better than the previous one, due to the use of the shared memory, although not all the accesses are really coalesced, see

Analyzing memory access coalescing of my CUDA kernel

My question is that if anyone of you can address me to a better solution in terms of coalesced memory access. Thank you.

Community
  • 1
  • 1
Vitality
  • 20,705
  • 4
  • 108
  • 146
  • Which architecture are you targeting? Have you checked that you are compiling for the architecture you are targeting? How many IPC (Instructions Per Clock) is the profiler reporting? – Roger Dahl Dec 11 '12 at 17:15
  • The architecture is an NVIDIA GeForce GT 540M Optimus and I'm compiling according to the Compute Capability 2.0. I have not yet checked the IPCs, but I will do it. I will also compare the timings of the two solutions and report it to you. – Vitality Dec 11 '12 at 21:11
  • I have timed the two solutions. Here is the result as a function of the size of the computational domain `512x512 No shared=9.8ms Shared=4.2ms 1024x1024 No shared=20.9ms Shared=14.5ms 2048x2048 No shared=82.8ms Shared=57.4ms` – Vitality Dec 12 '12 at 20:33
  • @RogerDahl I have measured the IPC. Here are the results. 1) first solution `max=1.209, avg=0.629, min=0.308`; 2) second solution `max=1.655, avg=1.018, min=0.331`. I would conclude that the second solution is better than the first one. Do you have any other comment on those outcomes? – Vitality Dec 13 '12 at 13:41

2 Answers2

1

Thank you for providing the profiling information.

Your second algorithm is better because you are getting a higher IPC. Still, on CC 2.0, max IPC is 2.0, so your average in the second solution of 1.018 means that only half of your available compute power is utilized. Normally, that means that your algorithm is memory bound, but I'm not sure in your case, because almost all the code in your kernel is inside if conditionals. A significant amount of warp divergence will affect performance, but I haven't checked if instructions which results are not used count towards the IPC.

You may want to look into reading through the texture cache. Textures are optimized for 2D spatial locality and better support semi-random 2D access. It may help your [i][j] type accesses.

In the current solution, make sure that it's the Y position ([j]) that changes the least between two threads with adjacent thread IDs (to keep memory accesses as sequential as possible).

It could be that the compiler has optimized this for you, but you recalculate idx*ydim+idy many times. Try calculating it once and reusing the result. That would have more potential for improvement if your algorithm was compute bound though.

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
  • Thank you for your suggestions. It seems to me that I've already analyzed the branching efficiency by the Visual Profiler and it returned around 99%. I will check it and give you a more precise answer. Concerning the use of the texture memory, I had already in mind to go in this direction, so I will try to implement it and let you know after Christmas. I will also check if the memory is read in the right order and avoid recalculations of `idx*ydim+idy`. Again, thank you. – Vitality Dec 14 '12 at 22:33
  • It seems that defining a variable for `idx*ydim+idy` and then using it for making accesses does not have any relevant influence on the IPC. By the CUDA debugger - CUDAWarpWatch, I have checked that `[j]`, `[j1]` and `[j2]` have the least change between two "consecutive" threads. In other words, if I watch, for example, `i` and `j`, the CUDAWarpWatch lists the variations of `i` before those of `j`. – Vitality Dec 17 '12 at 22:15
0

I believe that in this case your first parallel solution is better because each thread reads each global memory array element only once. Therefore storing these arrays in shared memory doesn't bring you expected improvement.

It could speed up your program due to better coalesced access to global memory during storing data in shared memory but IMHO this is balanced by caching of global memory accesses if you are using Compute Capability 2.x and also using of shared memory can be downgraded due to bank conflicts.

stuhlo
  • 1,479
  • 9
  • 17
  • Please, see the answers to Roger Dahl. It seems that you should conclude the opposite. Should you revise your statements? – Vitality Dec 13 '12 at 13:43
  • @JackOLantern Definitely. Considering profiling information it would be improper to stand for this statement. However I will ask what is the size of blocks in your first solution? – stuhlo Dec 14 '12 at 12:44
  • @JackOLantern And also what are the approximate values of `n1, n2, n11, n21`? – stuhlo Dec 14 '12 at 12:56
  • I'm comparing both the solutions with the same blocksize (32x32). n1, n2, n11 and n21 are index defining the size of the region bounded by the wavefront. They give a benefit at the beginning of the algorithm to save computation, but after a while n1=0, n2=xdim-1, n11=0, and n21=ydim, where xdim and ydim are the dimensions of the computational region. – Vitality Dec 14 '12 at 21:25