Memory padding vs coalesced access

Question

I have a little confusion about bank conflicts, avoiding them using memory padding and coalesced memory access. What I've read so far: Coalesced memory access from global memory is optimal. If it isn't achievable shared memory might be used to reorder the data needed by the current block and thus making coalesced access possible. However when using shared memory one has to look out for bank conflicts. One strategy to avoid bank conflicts is to pad the arrays stored in shared memory by 1. Consider the example from this blog post where each row of a 16x16 matrix is padded by 1 making it a 16x17 matrix in shared memory.

Now I understand that using memory padding might avoid bank conflicts but doesn't that also mean the memory is not aligned anymore? E.g. if I shift global memory by 1 thus misaligning it one warp would need to access two memory lanes instead of one because of the one last number not being in the same lane as all other numbers. So for my understanding coalesced memory access and memory padding are contradicting concepts, aren't they? Some clarification is appreciated very much!

Uncoalesced access to global memory is very expensive. In shared memory this is less of a problem (if at all) than bank conflicts. — paleonix, Jan 19 '22 at 18:44
@PaulG. Thanks for your comment. Do you have any references for that? E.g. is it officially stated by nvidia or is there some kind of study? — SimonH, Jan 20 '22 at 07:16
[This](https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#cuda-best-practices) is specifically for Ampere, but the documents for older architectures say the same. Coalescing is mentioned explicitly in the context of global memory. Other mentions are not as explicit, but I have not found (quick search) any which explicitly mention shared memory. — paleonix, Jan 20 '22 at 09:27
Thanks! I also didn't found any explicit statements on this topic that's why I'm asking this on SO. :) By the time I found another blog post which clarifies my mind a bit (but not fully) but thats too long for a comment so I'll make it an incomplete answer — SimonH, Jan 20 '22 at 11:03
BTW you can also correct the data layout with the warp shuffle instructions (which are kind of done by the shared memory unit, too, just without actually storing the data). You would read the data (probably more than one record) in a coalesced way and then reshuffle among the threads as you actually need it. The reverse way for storing. — Sebastian, Jan 20 '22 at 11:23
You can use Nsight Compute for getting definite answers about your code. For shared memory only the number of used lanes are important, whereby accessing the same element (and not only several elements in the same lane) only counts once. So neither alignment (except the 4 bytes for int/float) nor continuity of the accessed memory addresses are an issue with shared memory. — Sebastian, Jan 20 '22 at 11:26
@Sebastian Thanks! Didn't know about the shuffle instructions yet, definitely will take a look into it (sepecially if it's available for OpenCL, too). I already used Nsight Compute but of course knowing the details first-hand is better than taking own conclusions from a blackbox test. :) Still a great tool of course — SimonH, Jan 20 '22 at 11:49

score 1 · Answer 1 · answered Jan 20 '22 at 11:18

Too long for a comment so I'm putting it here. Still not a complete answer though.

By the time I found this post by Mark Harris which demonstrates the usage of shared memory to faciliate coalesced memory access. The important takeaway for this question seems to be:

The reason shared memory is used in this example is to facilitate global memory coalescing on older CUDA devices (Compute Capability 1.1 or earlier). Optimal global memory coalescing is achieved for both reads and writes because global memory is always accessed through the linear, aligned index t. The reversed index tr is only used to access shared memory, which does not have the sequential access restrictions of global memory for optimal performance. The only performance issue with shared memory is bank conflicts, which we will discuss later.

My initial understanding was that if coalesced access to global memory is not possible then it is read uncoalesced and then reordered in shared memory to achieve further coalesced accesses from shared memory. But instead data is read in a continous fashion from global memory and then the actual data needed can be read from shared memory in a non-coalesced way. Harris also states that uncoalesced access from shared memory is not a problem but unfortunately the post doesn't explain why.

Memory padding vs coalesced access

1 Answers1