I have a compute capability 1.3 GPU. Based on the documentation, when threads of the same half-warp access bytes from the same 32- 64- or 128-bytes memory segment depending on the word size, these memory accesses are coalesced into one.
However, in the case of a two-dimensional array allocated using cudaMallocPitch(), when threads of the same half-warp access consecutive bytes, is it guaranteed that these bytes reside to the same memory segment?
There is a similar question at CUDA coalesced access to global memory but does not cover compute capability 1.3 GPUs with 2D arrays.