in-place broadcast add along intermediate dimension in cuda

Question

I need to implement an in-place "broadcasting add along intermediate dimension" operation in CUDA. For example, a(60000*20*64) + b(60000*1*64) = c(60000*20*64). Currently, I have the following baseline kernel:


__global__ void AddKernel(float* a, float* b, int second_dim, int third_dim, unsigned int N){

        //N is 60000*20*64, second_dim is 20, third_dim is 64
        int idx = blockDim.x * blockIdx.x + threadIdx.x;
        if(idx > N) return;

        //60000 * 20 * 64 + 60000 * 1 * 64
        int tmp = idx / (second_dim * third_dim); 
        int mod = idx % third_dim;
        tmp = tmp * feature_dim + mod;
        a[idx] += b[tmp];
}

It works, but is there a more efficient way of achieving this?

You should really avoid the modulo to `third_dim`. This will strongly decrease performance. Could `third_dim` be a compile time constant? The same thing apply to the division by `second_dim * third_dim`. Division and modulo operations are one of the slowest. Besides this, the memory access pattern will likely be the thing slowing down the kernel. This pattern is dependent of the actual value so it is hard to predict. — Jérôme Richard, Jun 23 '23 at 16:10
@JérômeRichard On the GPU the cost of division/modulo might well be hidden behind memory access latencies and therefore not matter much. I.e. [memory access coalescing](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory) is the more important point. — paleonix, Jun 23 '23 at 20:38
Is there a reason for only using 1D blocks and grid? Are you trying to generalize to more than 3D tensors? — paleonix, Jun 23 '23 at 20:41
With zero-based indexing, `if(idx > N) return;` seems like a bug, should be `if(idx >= N) return;` — paleonix, Jun 23 '23 at 20:55
@paleonix Yes, assuming the memory access are not contiguous (so no coalescing), but for relatively contiguous accesses, the memory latency is mostly hidden and the throughput can be very high, sufficiently for the modulo to be a problem due to its generally lower throughput (at least, I remember seeing this on few GPUs including Nv-1660S and Nv-2070-RTX). — Jérôme Richard, Jun 23 '23 at 21:50
@JérômeRichard As you can see in my answer I'm not advocating for keeping the division/modulo. I'm just saying that one memory access should be the top priority for bandwidth-limited kernels (and I'm assuming this one is bw-limited). — paleonix, Jun 23 '23 at 21:59
@JérômeRichard Although I have to admit that given the right block-dimension, OPs memory access is already nicely coalescing (which is somewhat concealed by the index calculations). So in a way my first solution does mainly improve on OP's in terms of removing the division/remainder operations. — paleonix, Jun 23 '23 at 22:12
@ Jérôme Richard, they are sadly not compile-time constants since I need to add tensors of different shapes — user9875189, Jun 24 '23 at 04:37

paleonix · Answer 1 · 2023-06-30T12:28:44.853

The index calculations are much nicer if you make use of 3D blocks and a 3D grid:

__global__ void AddKernel(float* a, float* b,
                          int first_dim, int second_dim, int third_dim,
                          int feature_dim){

        // first_dim is 60000, second_dim is 20, third_dim is 64
        const int idz = blockIdx.z; // * blockDim.z + threadIdx.z;
        const int idy = blockIdx.y * blockDim.y + threadIdx.y;
        const int idx = blockIdx.x * blockDim.x + threadIdx.x;

        if (idz >= first_dim || idy >= second_dim || idx >= third_dim) return;

        const int ida = (idz * second_dim + idy) * third_dim + idx;
        const int idb = idz * feature_dim + idx;
        // idx indexing the linear dimension is important for coalescing

        //60000 * 20 * 64 + 60000 * 1 * 64
        a[ida] += b[idb];
}

int main() {
        ...
        // x-component should be multiple of warp size (32) for coalescing
        // ideal value for y-dimension should be tuned through benchmarking
        dim3 block_dims{32, 16, 1};

        dim3 grid_dims{(third_dim + block_dims.x - 1) / block_dims.x, // 2
                       (second_dim + block_dims.y - 1) / block_dims.y, // 2
                       first_dim;
        AddKernel<<<grid_dims, block_dims>>>(...)
        ...
}

Block Dimensions, Caching and Occupancy

block_dims.y == 20 would be ideal for L1-caching of values from b when second_dim == 20, but it might negatively impact occupancy, i.e. the number of blocks that can be scheduled to SMs in parallel. That is why I would recommend trying out a few launch configurations yourself. A look at the occupancy calculator in the Nsight Compute profiler might also be informative.

Parallelism and Computational intensity

If there is enough parallelism (which seems to be the case here), doing a little more sequential work can improve performance by increasing computational intensity, as there are more global loads queued per thread:

__global__ void AddKernel(float* a, float* b,
                          int first_dim, int second_dim, int third_dim,
                          int feature_dim){

        // first_dim is 60000, second_dim is 20, third_dim is 64
        const int idz = blockIdx.z; // * blockDim.z + threadIdx.z;
        const int idy = (blockIdx.y * blockDim.y + threadIdx.y) * 2;
        const int idx = blockIdx.x * blockDim.x + threadIdx.x;

        if (idz >= first_dim || idx >= third_dim) return;

        const int ida = (idz * second_dim + idy) * third_dim + idx;
        const int idb = idz * feature_dim + idx;
        // idx indexing the linear dimension is important for coalescing
       
        // 60000 * 20 * 64 + 60000 * 1 * 64
        const float b_val = b[idb]
        a[ida] += b_val;
        a[ida + third_dim] += b_val;
}

int main() {
        ...
        // x-component should be multiple of warp size (32) for coalescing
        dim3 block_dims{32, 16, 1};

        dim3 grid_dims{(third_dim + block_dims.x - 1) / block_dims.x, // 2
                       (second_dim + 2 * block_dims.y - 1) / (block_dims.y * 2), // 1
                       first_dim;
        AddKernel<<<grid_dims, block_dims>>>(...)
        ...
}

Now we can also nicely have all accesses to the same element of b in one single block without needing to worry much about occupancy (given second_dim == 20).

why need we the line ```a[ida + third_dim] += b_val;``` at the end of the second kernel to improve computational intensity? — user9875189, Jun 24 '23 at 04:33
@user9875189 Basically the compiler will start all three global loads directly after another before the first one returns, i.e. most if not all the latency from the additional load from `a` is hidden behind the first two loads (and addition, but that one is very fast/insignificant). You basically just give the compiler a little more to work with on the thread level in addition to the latency hiding happening on the SM level by context-switching between warps. That being said, I have not benchmarked this and would test these hypotheses if I were you. — paleonix, Jun 24 '23 at 11:39
@user9875189 Could it be that the kernel isn't launched with 1024 threads due to some kind of resource limit (e.g. number of registers)? You should check e.g. `cudaGetLastError()` directly after launch. See [What is the canonical way to check for errors using the CUDA runtime API?](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) — paleonix, Jun 24 '23 at 11:41
but ```a[ida + third_dim] += b_val``` can't produce correct results. — user9875189, Jun 24 '23 at 12:31
I'm happy to correct bugs in that code, but I wont take the time to write the corresponding `main()` function. If you include what you used to come to this conclusion in your question, i.e. an MRE, I can take a look. — paleonix, Jun 24 '23 at 14:43

in-place broadcast add along intermediate dimension in cuda

1 Answers1

Block Dimensions, Caching and Occupancy

Parallelism and Computational intensity