0

I started with CUDA and wrote two kernels for experiment. Whey both accept 3 pointers to array of n*n (matrix emulation) and n.

__global__
void th_single_row_add(float* a, float* b, float* c, int n) {
  int idx = blockDim.x * blockIdx.x * n + threadIdx.x * n;
  for (int i = 0; i < n; i ++) {
    if (idx + i >= n*n) return;
    c[idx + i] = a[idx + i] + b[idx + i];
  }
}

__global__
void th_single_col_add(float* a, float* b, float* c, int n) {
  int idx = blockDim.x * blockIdx.x + threadIdx.x;
  for (int i = 0; i < n; i ++) {
    int idx2 = idx + i * n;
    if (idx2 >= n*n) return;
    c[idx2] = a[idx2] + b[idx2];
  }
}

In th_single_row_add each thread sum rows on n elemnts, In th_single_col_add each thread sum columns. Here is profile on n = 1000 (1 000 000 elements)

986.29us  th_single_row_add(float*, float*, float*, int)
372.96us  th_single_col_add(float*, float*, float*, int)

As you see colums sum three times faster. I thought that because in the column variant all indexes in the loop are far from each other it should be slower, where I wrong?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Егор Лебедев
  • 1,161
  • 1
  • 10
  • 26

1 Answers1

4

Threads in CUDA don't act individually, they are grouped together in warps of 32 threads. Those 32 threads execute in lockstep (usually). An instruction issued to one thread is issued to all 32 at the same time, in the same clock cycle.

If that instruction is an instruction that reads memory (for example), then up to 32 independent reads may be required/requested. The exact patterns of addresses needed to satisfy these read operations is determined by the code you write. If those addresses are all "adjacent" in memory, that will be an efficient read. If those addresses are somehow "scattered" in memory, that will be an inefficient read, and will be slower.

This basic concept just described is called "coalesced" access in CUDA. Your column-summing case allows for coalesced access across a warp, because the addresses generated by each thread in the warp are in adjacent columns, and the locations are adjacent in memory. Your row summing case breaks this. The addresses generated by each thread in the warp are not adjacent (they are "columnar", separated from each other by the width of your array) and are therefore not "coalesced".

The difference in performance is due to this difference in memory access efficiency.

You can study more about coalescing behavior in CUDA by studying an introductory treatment of CUDA optimization, such as here especially slides 44-54.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • As I understand, this pattern `|1|2|3|1|2|3|1|2|3|` is faster than `|1|1|1|2|2|2|3|3|3|` because on every iteration of loop there will be adjacent elements in memory that warp will requests? (let for that example n = 3, warp size = 3, numbers are threadId.x) – Егор Лебедев Nov 09 '19 at 20:43
  • 1
    Compute the value of `idx+i` in the first `single_row_add` case, across the warp, and compare it to the calculation for `idx2` across the warp, in the second `single_col_add` case. Considering the first iteration of the loop (`i=0`), the pattern in the first case is `0*n|1*n|2*n|3*n|...`. In the second case it is `0|1|2|3|...`. Which of those indexing patterns selects adjacent locations in memory? The second one does, the first one does not (unless `n=1`). – Robert Crovella Nov 09 '19 at 20:51
  • 3
    There's a simple rule that allows you to look at any complex indexing calculation and quickly determine if it will result in coalesced access or not. Assuming the indexing calculation involves `threadIdx.x`, if there is any multiplicative term on `threadIdx.x` (other than 1) then that is a bad pattern. We want `threadIdx.x` to be included in the indexing calculation only as an additive term, by itself. Your first example breaks this rule by multiplying `threadIdx.x` by `n`. The second example includes `threadIdx.x` by itself as an additive term, so it is a good pattern. – Robert Crovella Nov 09 '19 at 20:59