0

I recently implemented a reduction algorithm following the great tutorial from Mark Harris.

After the first reduction step described in the tutorial, I implemented the second pass (reduction of the blocks-reduced results) with an additional kernel, which is called on a grid of 1 block with block size (number of threads) equal to the number of blocks used in the previous step (actually divided by two since I'm performing the first add during load as mentioned in the tutorial).

The reduction I am performing is essentially a sum reduction, but I am calculating the terms to be added inside the first-pass kernel during the first add to avoid calling a previous kernel first. This way I can avoid to store the calculated terms in global memory. Therefore, the kernel for the first pass reduction (summation terms calculation + sum reduction) is slightly different from the kernel I use for the second pass (pure sum reduction).

The two-steps reduction works fine, but it has the following issues:

  • I need a second reduction kernel for the "pure sum" second pass
  • If the grid size of the first pass kernel is larger than 1024 (actually, 2048, see above), the second pass kernel will not run due to the block size being limited to 1024 threads

I found out about an alternative, using atomic functions, as described here. In this case, I can have a single kernel, and I don't need to worry about the block size of the second pass kernel. All in all, I like the approach much more. I am presently reducing 2D matrices of sizes from 128x128 to 2048x2048. Probably also due to the calculations I'm performing in the kernel, I don't notice any performance downside when using the atomic sum instead of the additional kernel.

When comparing the results of the two approaches, the results of the atomic sum are almost the same as the ones of the second pass kernel, but they continuously change over time. In the tests I did, the changes are very small and cyclical, so I could definitely live with them, but I was wondering why these changes happen and if I should worry about them even if in my tests they seem negligible. I would really like to adopt the atomic sum approach, but not if that might hinder the accuracy of the results.

I guess the changes I'm seeing could be related to different orders of execution of the final atomic adds, but I wanted to hear the opinion of somebody with more experience than me.

Can I safely assume that the time changes I notice in the results of the atomic add approach are just "numerical issues" that will not impact the accuracy of the results in any significant way?

paleonix
  • 2,293
  • 1
  • 13
  • 29
donkaio
  • 1
  • 3
  • 2
    Yes, what you are seeing should just be floating point error from reordering operations as they are not associative. – paleonix Aug 11 '23 at 11:01
  • You can find three different implementations without atomics in the official CUDA samples: [`2_Concepts_and_Techniques/reduction`](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/2_Concepts_and_Techniques/reduction), [`2_Concepts_and_Techniques/threadFenceReduction`](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/2_Concepts_and_Techniques/threadFenceReduction), [`2_Concepts_and_Techniques/reductionMultiBlockCG`](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/2_Concepts_and_Techniques/reductionMultiBlockCG). The first one is the classic multi-launch version. – paleonix Aug 11 '23 at 11:04
  • It just keeps launching kernels until the reduction is complete. There is no reason to limit yourself to a two launches other than observed performance. Atomics are generally said not to scale well in addition to their non-deterministic floating point results. I still have seen them used in a good number of codes. – paleonix Aug 11 '23 at 11:08
  • 4
    For production (instead of education) I would use the CUB library coming with the CUDA Toolkit, i.e. [`cub::DeviceReduce`](https://nvlabs.github.io/cub/structcub_1_1_device_reduce.html) (can use a `transform_iterator` for the first pass calculations) or, if that isn't flexible enough, on the block level [`cub::BlockReduce`](https://nvlabs.github.io/cub/classcub_1_1_block_reduce.html). – paleonix Aug 11 '23 at 11:13
  • @paleonix Thanks a lot for the clarification and the very useful examples! I will surely give a look to all of them and see which one suits my application best. I did not know about cub, looks indeed interesting for production code – donkaio Aug 11 '23 at 13:00

0 Answers0