Aligning GPU memory accesses of an image convolution (OpenCL/CUDA) kernel

Question

To understand how to make sure alignment requirement is met I read the following passage from the book Heterogeneous Computing with OpenCL p.no: 157, several times. This shows how to put padding for a problem in Image convolution (assuming 16 x 16 workgroup size).

Aligning for Memory Accesses

Performance on both NVIDIA and AMD GPUs benefits from data alignment in global memory. Particularly for NVIDIA, aligning accesses on 128-byte boundaries and accessing 128-byte segments will map ideally to the memory hardware.However, in this example, the 16-wide workgroups will only be accessing 64-byte segments, so data should be aligned to 64-byte addresses. This means that the first column that each workgroup accesses should begin at a 64-byte aligned address. In this example, the choice to have the border pixels not produce values determines that the offset for all workgroups will be a multiple of the workgroup dimensions (i.e., for a 16 x 16 workgroup, workgroup will begin accessing data at column N*16). To ensure that each workgroup aligns properly, the only requirement then is to pad the input data with extra columns so that its width becomes a multiple of the X-dimension of the workgroup.

1-Can anybody help me in understanding how after padding the first column that each workgroup accesses is beginning at a 64-byte aligned address (the requirement as mentioned in the above passage, right?)?

2-Also wrt the figure is the statement correct: for a 16 x 16 workgroup, workgroup will begin accessing data at column N*16.

if it is correct the workgroup 1,2 as shown in figure should start accessing data at column 1x16 contrary to what is shown in the figure. I am totally confused!! :(

Update: Q-2 is now clear to me. Actually the workgroup shown in figure is 2,1 (in opencl convention, column first ), so it is perfectly correct: 2x16=32 and not 1x16 as I was thinking.

But question no. 1 is still unanswered.

enter image description here

score 8 · Accepted Answer · edited May 23 '17 at 12:28

For a convolution kernel, each region (e.g. region (0,1) or region (2,1) etc.) must also include a "halo" of data around it, so that when the convolution kernel is operating on a data element at the edge of the region, it has a suitable set of neighbors of that data element to compute the convolution at that data point. This means that for region (0,0) which has data element (0,0) in it's upper left hand corner, I need elements (-1, 0), (-2, 0) etc. in order to compute the convolution at element (0,0).

Now, if I store the image normally, so that element (0,0) is at memory location 0, (or some other 64-byte aligned address), then as I access elements prior to that point for the convolution, I will be accessing data outside of my 64-byte aligned region. Therefore we can "pad" the leftmost column of the image with additional data elements "to the left" i.e. prior, in address, so that the convolution kernel picks up values that are all within the 64-byte aligned region, and I am not straddling a 64-byte boundary. Rather than starting the image storage at memory location 0, we start the halo storage at memory location zero, and the first image data element begins at location 0 + halo width. This padding can also have the affect of aligning the halo-border of other regions, as indicated by the intersection of the red dotted lines in the image, assuming the x and y dimensions of the region are multiples of the workgroup x and y dimensions, as indicated in the diagram.

Now let's also assume that the image is some non-power-of-2 width (e.g. 1920 pixels wide for a HD image). If we were simply to include the halo width as padding at the right hand side of the image (ie. at the end of a pixel row), and then we started the halo area of the next pixel row immediately following that, we'd also be unlikely to have a properly aligned region starting at the next pixel row (including halo). Therefore we put additional padding at the end of each row (which is not touched by any convolution operation, it's just wasted space) so that when we begin the halo area of the next pixel row, it begins on a properly aligned address.

This discussion and method (and question, I believe) is really only focused on making sure the starting address of each workgroup data access is aligned. As long as the starting address of the first workgroup is aligned (through suitable padding and adjustment of the image storage in memory), and the workgroups have appropriate dimensions (e.g. 16 wide, with 4 bytes per worker), then the starting address of the next workgroup will be aligned also. There will of course be overlap between data accesses of adjacent workgroups, as the halo region for adjacent workgroups is overlapping.

Alignment as I'm using it here has a fairly simple definition. An address in memory is 2**n byte aligned if the least significant n bits of the address are all zero. Therefore a 64-byte aligned region has a starting address with 6 least signficant bits all zero. This is generally useful on these architectures to satisfy the memory load and store requirements of the architecture, and in particular of the DRAM subsystems they contain. Modern DRAM memory bank accesses always return multiple bytes, and so we can make best effective use of the transfer if we are using all those bytes at the same time, in the same place in the code. For additional coverage on alignment and the effect it has on coalescing and improvement in data access, you might be interested in this webinar (and slides)from the nvidia webinar page. For a quick look, slides 26-33 of this presentation cover the basic ideas.

Thanks for your reply. What is 64-byte boundary and what is 64-byte alignment? — gpuguy, Nov 01 '12 at 08:16
For the sake of everyone's eyes, please edit some paragraphs into this answer! — talonmies, Nov 01 '12 at 09:31
I've edited the answer to add paragraph formatting and also address @gpuguy's question — Robert Crovella, Nov 01 '12 at 23:01
@RobertCrovella Nice explanation! That was exactly I wanted to understand — gpuguy, Dec 15 '12 at 08:16

Aligning GPU memory accesses of an image convolution (OpenCL/CUDA) kernel

1 Answers1