1

I was trying to implement FDTD equations on the GPU. I initially had implemented the kernel which used global memory. The memory coalescing wasn't that great. Hence I implemented another kernel which used shared memory to load the values. I am working on a grid of 1024x1024.

The code is below

__global__ void update_Hx(float *Hx, float *Ez, float *coef1, float* coef2){
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    int offset = x + y * blockDim.x * gridDim.x;
    __shared__ float  Ez_shared[BLOCKSIZE_HX][BLOCKSIZE_HY + 1];
    /*int top = offset + x_index_dim;*/
    if(threadIdx.y == (blockDim.y - 1)){
        Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
        Ez_shared[threadIdx.x][threadIdx.y + 1] = Ez[offset + x_index_dim];
   }
    else{
        Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
    }
}

The constants BLOCKSIZE_HX = 16 and BLOCKSIZE_HY = 16.

When I run the visual profiler, it still says that the memory is not coalesced.

EDIT: I am using GT 520 graphic card with cuda compute capability of 2.1. My Global L2 transactions / Access = 7.5 i.e there is 245 760 L2 transactions for 32768 executions of the line Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];

Global memory load efficiency is 50%.

Global memory load efficiency = 100 * gld_requested_throughput/ gld_throughput

I am not able to figure out why there are so many memory accesses, though my threads are looking at 16 consecutive values. Can somebody point to me what I am doing wrong?

EDIT: Thanks for all the help.

catchmrbharath
  • 69
  • 1
  • 2
  • 8
  • 1
    Please provide a complete sample. x_index_dim is not defined in the source code you provided. Whit what grid and blocksizes do you launch the kernel? – RoBiK Feb 11 '13 at 11:57
  • Your should change your shared memory definition and usage. Also you should rewrite the if statement. `__shared__ float Ez_shared[BLOCKSIZE_HY + 1][BLOCKSIZE_HX];` – RoBiK Feb 11 '13 at 12:06
  • Just a comment which perhaps does not fully answer your question. At the page [FDTD-OrangeOwlSolutions](http://www.orangeowlsolutions.com/?s=fdtd), a solution is described avoiding the use of the `if` statement, implementing a way to fill a (BS_X+1)*(BS_Y+1) shared memory matrix by BS_X*BS_Y threads (see also the related discussion on [StackOverflow](http://stackoverflow.com/questions/13771538/moving-a-bs-x1bs-y1-global-memory-matrix-by-bs-xbs-y-threads)). – Vitality Feb 11 '13 at 12:17
  • As @RoBiK said, one issue is that threads are grouped together in x dimension first (when the machine is organizing threads into warps). Therefore, you want data accesses referencing memory elements to be in the order of `data[z][y][x]` not `data[x][y][z]` – Robert Crovella Feb 11 '13 at 14:06
  • @RoBiK I had changed the order before, and had rand the profiler. I still have the same issue. – catchmrbharath Feb 11 '13 at 15:51
  • @JackOLantern I had gone through the stackoverflow discussion before. There seems to be no answer in the discussion, just how to figure out whether there is memory coalescing or not. The FDTD-OrangeOwlSolution link is not working. – catchmrbharath Feb 11 '13 at 15:53
  • @catchmrbharath what counter or combination of counters is "Global L2 transactions / Access" ? About what kind of access are we talking about? – RoBiK Feb 11 '13 at 16:11
  • @catchmrbharath If the OrangeOwlSolution link does not work (although it seems to me it does) then you could take a look at [CUDA_Course](http://www.bu.edu/pasi/files/2011/07/Lecture31.pdf) which, however, does not deal with an electromagnetic problem, as maybe you do. In your code, each two columns access a 128-byte cache line, except for the last two, due to the `if` statement. So, the access pattern is not fully coalesced for sure. In total I think you need 14(columns)/2 + 2 (last two) = 9 gmem transactions per block. What does "Global L2 trans/Access" measure? On which GPU are you running? – Vitality Feb 11 '13 at 16:25
  • @RoBiK Even I am not sure of Global L2 transactions / access measure. This is something that the visual profiler gives. there is `245 760` L2 transactions for `32768` executions of the line `Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];` – catchmrbharath Feb 12 '13 at 07:08
  • @JackOLantern Thanks for the link. It looks really helpful. I am running GT 520 M with compute capability 2.1. – catchmrbharath Feb 12 '13 at 07:09

1 Answers1

1

Your memory access pattern is the problem here. You are getting only 50% efficiency (for both L1 and L2) because you are accessing consecutive regions of 16 floats, that is 64 bytes but the L1 transaction size is 128 bytes. This means that for every 64 bytes requested 128 bytes must be loaded into L1 (and in consequence also into L2).

You also have a problem with shared memory bank conflicts but that is currently not negatively affecting your global memory load efficiency.

You could solve the the load efficiency problem in several ways. The easiest would be to change the x dimension block size to 32. If that is not an option you could change the global memory data layout so that each two consecutive blockIdx.y ([0, 1], [2,3] etc.) values would map to a continuous memory block. If even that is not an option and you have to load the global data only once anyway you could use non-cached global memory loads to bypass the L1 - that would help because L2 uses 32 byte transactions so your 64bytes would be loaded in two L2 transactions without overhead.

RoBiK
  • 1,740
  • 12
  • 15
  • Changing the xdimension helped. But where do I have bank conflicts. Do you mean change `EZ_shared[BLOCKSIZE_HX][BLOCKSIZE_HY]` to `EZ_shared[BLOCKSIZE_HY][BLOCKSIZE_Hx]`. I have already done that. I am surprised that the code with shared memory behaves worse than that generally accessing all values directly from global memory. – catchmrbharath Feb 12 '13 at 11:38
  • With direct access from global memory the kernel takes 3.259 ms while using shared memory takes 4.139 ms for executing once. The blocksize used while using direct access from global memory is `256x1`. – catchmrbharath Feb 12 '13 at 12:00
  • @catchmrbharath it all depends on what you do with the loaded values. If you only load and use the values once than shared memory would be unnecessary. If you want to use the values multiple times or from threads that are different from the threads where the values were loaded you should use shared memory. – RoBiK Feb 12 '13 at 13:09