I recently implemented a reduction algorithm following the great tutorial from Mark Harris.
After the first reduction step described in the tutorial, I implemented the second pass (reduction of the blocks-reduced results) with an additional kernel, which is called on a grid of 1 block with block size (number of threads) equal to the number of blocks used in the previous step (actually divided by two since I'm performing the first add during load as mentioned in the tutorial).
The reduction I am performing is essentially a sum reduction, but I am calculating the terms to be added inside the first-pass kernel during the first add to avoid calling a previous kernel first. This way I can avoid to store the calculated terms in global memory. Therefore, the kernel for the first pass reduction (summation terms calculation + sum reduction) is slightly different from the kernel I use for the second pass (pure sum reduction).
The two-steps reduction works fine, but it has the following issues:
- I need a second reduction kernel for the "pure sum" second pass
- If the grid size of the first pass kernel is larger than 1024 (actually, 2048, see above), the second pass kernel will not run due to the block size being limited to 1024 threads
I found out about an alternative, using atomic functions, as described here. In this case, I can have a single kernel, and I don't need to worry about the block size of the second pass kernel. All in all, I like the approach much more. I am presently reducing 2D matrices of sizes from 128x128 to 2048x2048. Probably also due to the calculations I'm performing in the kernel, I don't notice any performance downside when using the atomic sum instead of the additional kernel.
When comparing the results of the two approaches, the results of the atomic sum are almost the same as the ones of the second pass kernel, but they continuously change over time. In the tests I did, the changes are very small and cyclical, so I could definitely live with them, but I was wondering why these changes happen and if I should worry about them even if in my tests they seem negligible. I would really like to adopt the atomic sum approach, but not if that might hinder the accuracy of the results.
I guess the changes I'm seeing could be related to different orders of execution of the final atomic adds, but I wanted to hear the opinion of somebody with more experience than me.
Can I safely assume that the time changes I notice in the results of the atomic add approach are just "numerical issues" that will not impact the accuracy of the results in any significant way?