I started with CUDA and wrote two kernels for experiment. Whey both accept 3 pointers to array of n*n (matrix emulation) and n.
__global__
void th_single_row_add(float* a, float* b, float* c, int n) {
int idx = blockDim.x * blockIdx.x * n + threadIdx.x * n;
for (int i = 0; i < n; i ++) {
if (idx + i >= n*n) return;
c[idx + i] = a[idx + i] + b[idx + i];
}
}
__global__
void th_single_col_add(float* a, float* b, float* c, int n) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = 0; i < n; i ++) {
int idx2 = idx + i * n;
if (idx2 >= n*n) return;
c[idx2] = a[idx2] + b[idx2];
}
}
In th_single_row_add
each thread sum rows on n
elemnts, In th_single_col_add
each thread sum columns.
Here is profile on n = 1000
(1 000 000 elements)
986.29us th_single_row_add(float*, float*, float*, int)
372.96us th_single_col_add(float*, float*, float*, int)
As you see colums sum three times faster.
I thought that because in the column
variant all indexes in the loop are far from each other it should be slower, where I wrong?