CUDA: are half-warp accesses to consecutive bytes of the global memory guaranteed to be coalesced?

Question

I have a compute capability 1.3 GPU. Based on the documentation, when threads of the same half-warp access bytes from the same 32- 64- or 128-bytes memory segment depending on the word size, these memory accesses are coalesced into one.

However, in the case of a two-dimensional array allocated using cudaMallocPitch(), when threads of the same half-warp access consecutive bytes, is it guaranteed that these bytes reside to the same memory segment?

There is a similar question at CUDA coalesced access to global memory but does not cover compute capability 1.3 GPUs with 2D arrays.

score 2 · Answer 1 · edited Feb 21 '13 at 10:51

2

Yes - cudaMallocPitch() mainly exists to make sure that coalescing behaviors persist from one row to the next. The criteria for coalescing are per-warp, so they are much finer-grained and pertain to consecutive bytes within a row of a 2D array.

Note that the operand size must be at least 32 bits, or coalescing does not work at all.

edited Feb 21 '13 at 10:51

sgarizvi

16,623
9
64
98

answered Feb 16 '13 at 16:01

ArchaeaSoftware

4,332
16
21

"Note that the operand size must be at least 32 bits, or coalescing does not work at all." That sentence applies to compute capability 1.0 and 1.1, but not to newer GPUs. – tera Feb 22 '13 at 11:13
Tera, I guess it depends on how you define "coalescing." 8- and 16-bit accesses are much slower on every level of CUDA hardware. On my GK104, the disparity is 2.4x for 8-bit accesses and 40% (100GB/s versus 142GB/s for 64-bit accesses) for 16-bit accesses. https://github.com/ArchaeaSoftware/cudahandbook/blob/master/memory/globalRead.cu – ArchaeaSoftware Feb 25 '13 at 06:31
Of course 8- and 16-bit accesses are much slower, because on CC2.0+ they waste part of the cacheline (with the cache being too small to guarantee a 100% reuse) while on CC 1.2/1.3 they don't use memory bursts as efficiently (and in the case of 8-bit accesses waste half of the memory bandwidth because the amount of data to transfer is less than the minimum transaction width). Nevertheless the accesses of a half-warp still get coalesced into a single transaction (provided the address requirements are met). – tera Feb 26 '13 at 14:07

CUDA: are half-warp accesses to consecutive bytes of the global memory guaranteed to be coalesced?

1 Answers1