Suppose many warps in a (CUDA kernel grid) block are updating a fair-sized number of shared memory locations, repeatedly.
In which of the cases will such work be completed faster? :
- The case of intra-warp access locality, e.g. the total number of memory position accessed by each warp is small and most of them are indeed accessed by multiple lanes
- The case of access anti-locality, where all lanes typically access distinct positions (and perhaps with an effort to avoid bank conflicts)?
and no less importantly - is this microarchitecture-dependent, or is it essentially the same on all recent NVIDIA microarchitectures?