Upload data in shared memory for convolution kernel

Question

I am having some difficulties to understand the batch loading as in the comments is referred. In order to compute the convolution in a pixel the mask whose size is 5 must become centered on this specific pixel. The image is divided into tiles. These tiles after applying the convolution mask are the final output tiles whose size is TILE_WIDTH*TILE_WIDTH. For the pixels that belong to the border of the output tile the mask must borrow some pixels from the neighbor tile, when this tile belong to the borders of the image. Otherwise, these borrowed values are assigned to zero. These two steps are depicted in

if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
    N_ds[destY][destX] = I[src];
else
    N_ds[destY][destX] = 0;

For that reason the shared memory array has TILE_WIDTH + Mask_width - 1 dimension in each side. The following parts of the code are unclear to me.

The destY and destX index. Dividing the output index by the input tile width what does it means?

The srcY add srcX index.
Why destY and destX index take part in srcY add srcX index?

srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;

Why in the second loading we use the offset TILE_WIDTH * TILE_WIDTH?
Generally, what is the intuitive explanation of having two loadings?
Can all these question followed by an intuitive example based on the image bellow?
Thank you!

EDIT: Image added. In green there are the output tiles and in red we have the mask centered in 114 index. It is obvious that the mask borrows elements from different tiles. Finally, this image refers to one channel.

Example: Based on the image below I have tried to write an example. The output tile has blockIdx.x=1 and blockIdx.y=1 based on that destY=0 and destX=0. Also,
srcY = 1*6+0-3=3, srcX = 3 and src = (3*18+3)*3+0=171. Based on the calculations and the image example we do not have a match. In the first shared memory position the value that should be stored is that with global index 57. What is wrong with the above-mentioned calculations? Can any one help please?

enter image description here

#define Mask_width  5
#define Mask_radius Mask_width/2
#define TILE_WIDTH 16
#define w (TILE_WIDTH + Mask_width - 1)
#define clamp(x) (min(max((x), 0.0), 1.0))

__global__ void convolution(float *I, const float* __restrict__ M, float *P,
                            int channels, int width, int height) {
   __shared__ float N_ds[w][w];
   int k;
   for (k = 0; k < channels; k++) {
      // First batch loading
      int dest = threadIdx.y * TILE_WIDTH + threadIdx.x,
         destY = dest / w, destX = dest % w,
         srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius,
         srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius,
         src = (srcY * width + srcX) * channels + k;
      if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
         N_ds[destY][destX] = I[src];
      else
         N_ds[destY][destX] = 0;

      // Second batch loading
      dest = threadIdx.y * TILE_WIDTH + threadIdx.x + TILE_WIDTH * TILE_WIDTH;
      destY = dest / w, destX = dest % w;
      srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
      srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
      src = (srcY * width + srcX) * channels + k;
      if (destY < w) {
         if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
            N_ds[destY][destX] = I[src];
         else
            N_ds[destY][destX] = 0;
      }
      __syncthreads();

      float accum = 0;
      int y, x;
      for (y = 0; y < Mask_width; y++)
         for (x = 0; x < Mask_width; x++)
            accum += N_ds[threadIdx.y + y][threadIdx.x + x] * M[y * Mask_width + x];
      y = blockIdx.y * TILE_WIDTH + threadIdx.y;
      x = blockIdx.x * TILE_WIDTH + threadIdx.x;
      if (y < height && x < width)
         P[(y * width + x) * channels + k] = clamp(accum);
      __syncthreads();
   }
}

Ok, I think it is becoming more clear. I read the in order to go from 1D index to 2d you do the following: Y = index1D / width, X = index1D % width. I didn't know that. — Thoth, Jan 27 '14 at 18:41
But it doesn't divide by `TILE_WIDTH`? On the contrary divides with the input tile width, `w`. Why it is that? — Thoth, Jan 27 '14 at 19:02
I wish I could understand you. What do you mean "in elements"? What transformation do we have by dividing an indexing scheme with width `TILE_WIDTH` with another tile width? Could you please be more intuitive. I am trying to make clear this code on my mind the last two days but important knowledge or depth of understanding is missing. Thanks. — Thoth, Jan 28 '14 at 12:45
Adaption to 3D convolution using shared memory: [here][1] [1]: http://stackoverflow.com/questions/22577857/3d-convolution-with-cuda-using-shared-memory — Schnigges, Mar 22 '14 at 13:01

score 8 · Accepted Answer · edited May 23 '17 at 12:25

Your question is similar in concept to my first question on StackOverflow: Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads.

You are facing the following problem: each thread block of size TILE_WIDTHxTILE_WIDTH should fill a shared memory area of size (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1).

4) Generally, what is the intuitive explanation of having two loadings?

Since the shared memory area (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1) is larger than the block size TILE_WIDTHxTILE_WIDTH and assuming it is smaller than 2xTILE_WIDTHxTILE_WIDTH, then each thread should move at most two elements from global memory to shared memory. This is the reason why you have a two-stages loading.

1) The destY and destX index. Dividing the output index by the input tile width what does it means?

This concerns the first load stage which is appointed to load TILE_WIDTHxTILE_WIDTH elements from global memory and fills the uppermost part of the shared memory area.

So, the operation

dest = threadIdx.y * TILE_WIDTH + threadIdx.x;

flattens the 2D coordinates of the generic thread while

destX = dest % w;
destY = dest / w;

makes the inverse operation, in that it calculates the 2D coordinates of the generic thread with respect to the shared memory area.

2) The srcY add srcX index. Why destY and destX index take part in srcY add srcX index?

srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;

srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;

(blockIdx.x * TILE_WIDTH, blockIdx.y * TILE_WIDTH) would be the coordinates of the global memory location if the block size and the shared memory size were the same. Since you are "borrowing" memory values also from neighboor tiles, then you have to shift the above coordinates by (destX - Mask_radius, destY - Mask_radius).

3) Why in the second loading we use the offset TILE_WIDTH * TILE_WIDTH?

You have this offset because in the first memory stage you have already filled the "first" TILE_WIDTHxTILE_WIDTH locations of the shared memory.

EDIT

The picture below illustrates the correspondence between the flattened thread index dest and the shared memory locations. In the picture, the blue boxes represent the elements of the generic tile while the red boxes the elements of the neighboor tiles. The union of the blue and red boxes correspond to the overall shared memory locations. As you can see, all the 256 threads of a thread block are involved in filling the upper part of the shared memory above the green line, while only 145 are involved in filling the lower part of the shared memory below the green line. Now you should also understand the TILE_WIDTH x TILE_WIDTH offset.

Please, notice that you have at most 2 memory loads per thread due to the particular choice of your parameters. For example, if you have TILE_WIDTH = 8, then the number of threads in a thread block is 64, while the shared memory size is 12x12=144, which means that each thread is in charge to perform at least 2 shared memory writes since 144/64=2.25.

enter image description here

@JackOLantem thanks for your answer! I will come back with comments as soon as possible:)Thanks again. — Thoth, Jan 30 '14 at 23:07
In `4)` you mention that each "thread move two elements at most form global memory". Could you please give an example of this? I am not sure I have understand it. In `1)` you mention "TILE_WIDTHxTILE_WIDTH elements from global memory and fills the uppermost part of the shared memory area", in the grid above which elements are those? In `3)` you mention "first" TILE_WIDTHxTILE_WIDTH" locations. Would you mind give an arithmetic example for this too? Your help is invaluable to understand this concepts!Thank you! — Thoth, Jan 31 '14 at 20:30

Upload data in shared memory for convolution kernel

1 Answers1

Linked