To understand how to make sure alignment requirement is met I read the following passage from the book Heterogeneous Computing with OpenCL p.no: 157, several times. This shows how to put padding for a problem in Image convolution (assuming 16 x 16 workgroup size).
Aligning for Memory Accesses
Performance on both NVIDIA and AMD GPUs benefits from data alignment in global memory. Particularly for NVIDIA, aligning accesses on 128-byte boundaries and accessing 128-byte segments will map ideally to the memory hardware.However, in this example, the 16-wide workgroups will only be accessing 64-byte segments, so data should be aligned to 64-byte addresses. This means that the first column that each workgroup accesses should begin at a 64-byte aligned address. In this example, the choice to have the border pixels not produce values determines that the offset for all workgroups will be a multiple of the workgroup dimensions (i.e., for a 16 x 16 workgroup, workgroup will begin accessing data at column N*16). To ensure that each workgroup aligns properly, the only requirement then is to pad the input data with extra columns so that its width becomes a multiple of the X-dimension of the workgroup.
1-Can anybody help me in understanding how after padding the first column that each workgroup accesses is beginning at a 64-byte aligned address (the requirement as mentioned in the above passage, right?)?
2-Also wrt the figure is the statement correct: for a 16 x 16 workgroup, workgroup will begin accessing data at column N*16.
if it is correct the workgroup 1,2 as shown in figure should start accessing data at column 1x16 contrary to what is shown in the figure. I am totally confused!! :(
Update: Q-2 is now clear to me. Actually the workgroup shown in figure is 2,1 (in opencl convention, column first ), so it is perfectly correct: 2x16=32 and not 1x16 as I was thinking.
But question no. 1 is still unanswered.