The index calculations are much nicer if you make use of 3D blocks and a 3D grid:
__global__ void AddKernel(float* a, float* b,
int first_dim, int second_dim, int third_dim,
int feature_dim){
// first_dim is 60000, second_dim is 20, third_dim is 64
const int idz = blockIdx.z; // * blockDim.z + threadIdx.z;
const int idy = blockIdx.y * blockDim.y + threadIdx.y;
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idz >= first_dim || idy >= second_dim || idx >= third_dim) return;
const int ida = (idz * second_dim + idy) * third_dim + idx;
const int idb = idz * feature_dim + idx;
// idx indexing the linear dimension is important for coalescing
//60000 * 20 * 64 + 60000 * 1 * 64
a[ida] += b[idb];
}
int main() {
...
// x-component should be multiple of warp size (32) for coalescing
// ideal value for y-dimension should be tuned through benchmarking
dim3 block_dims{32, 16, 1};
dim3 grid_dims{(third_dim + block_dims.x - 1) / block_dims.x, // 2
(second_dim + block_dims.y - 1) / block_dims.y, // 2
first_dim;
AddKernel<<<grid_dims, block_dims>>>(...)
...
}
Block Dimensions, Caching and Occupancy
block_dims.y == 20
would be ideal for L1-caching of values from b
when second_dim == 20
, but it might negatively impact occupancy, i.e. the number of blocks that can be scheduled to SMs in parallel. That is why I would recommend trying out a few launch configurations yourself. A look at the occupancy calculator in the Nsight Compute profiler might also be informative.
Parallelism and Computational intensity
If there is enough parallelism (which seems to be the case here),
doing a little more sequential work can improve performance by increasing computational intensity, as there are more global loads queued per thread:
__global__ void AddKernel(float* a, float* b,
int first_dim, int second_dim, int third_dim,
int feature_dim){
// first_dim is 60000, second_dim is 20, third_dim is 64
const int idz = blockIdx.z; // * blockDim.z + threadIdx.z;
const int idy = (blockIdx.y * blockDim.y + threadIdx.y) * 2;
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idz >= first_dim || idx >= third_dim) return;
const int ida = (idz * second_dim + idy) * third_dim + idx;
const int idb = idz * feature_dim + idx;
// idx indexing the linear dimension is important for coalescing
// 60000 * 20 * 64 + 60000 * 1 * 64
const float b_val = b[idb]
a[ida] += b_val;
a[ida + third_dim] += b_val;
}
int main() {
...
// x-component should be multiple of warp size (32) for coalescing
dim3 block_dims{32, 16, 1};
dim3 grid_dims{(third_dim + block_dims.x - 1) / block_dims.x, // 2
(second_dim + 2 * block_dims.y - 1) / (block_dims.y * 2), // 1
first_dim;
AddKernel<<<grid_dims, block_dims>>>(...)
...
}
Now we can also nicely have all accesses to the same element of b
in one single block without needing to worry much about occupancy (given second_dim == 20
).