Context
thrust::reduce
uses cub::DeviceReduce
in the backend. The documentation says that
DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
When Thrust allocates scratch space for DeviceReduce
with cudaMalloc
in device code, it probably runs into size limitations:
cudaMalloc()
and cudaFree()
have distinct semantics between the host and device environments. [...] When invoked from the device runtime these functions map to device-side malloc()
and free()
. This implies that within the device environment the total allocatable memory is limited to the device malloc()
heap size, which may be smaller than the available unused device memory.
Solution Candidate 1: Increase cudaLimitMallocHeapSize
This limit can be set with cudaDeviceSetLimit(cudaLimitMallocHeapSize, value)
which should enable bigger reductions.
Solution Candidate 2: Use Custom Allocators (untested)
When calling Thrust algorithms on the host, one can avoid temporary allocations for scratch space by passing a custom allocator to the execution policy as demonstrated in examples/cuda/custom_temporary_allocation.cu
. I have never tried to do this in device code so I can't guarantee that this is implemented, but one could try to do the allocations on the host, wrap them in an allocator on the device and pass them to thrust::device
. If this works, doing a single big allocation on the host should be much more efficient than doing many (N
) small allocations on the device.
Solution Candidate 3: Avoid Using CUDA Dynamic Parallelism (preferred)
You can find multiple alternative ways of implementing this reduction in my answers to How to do a reduction over one dimension of 2D data in Thrust and Parallelization of a for loop consisting of Thrust Transforms. In short I would recommend using cub::DeviceSegmentedReduce
(from the host).
I prefer avoiding CUDA Dynamic Parallelism (CDP) for this problem not only because it is deprecated* in Thrust >= 1.15, but also because I do not expect it to give good performance for problems which do not need CDP due to the amount of parallelism being data dependent**. The very regular problem of reducing a dimension of a dense matrix does not need Dynamic Parallelism and therefore performance should not have to suffer from CDP overheads.
*: CUDA 12 introduced a new CDP API disallowing cudaDeviceSynchronize()
in device code which Thrust needs to keep usage consistent between host and device. The Thrust CDP deprecation notice says
A future version of Thrust will remove support for CUDA Dynamic Parallelism (CDP). This will only affect calls to Thrust algorithms made from CUDA device-side code that currently launches a kernel; such calls will instead execute sequentially on the calling GPU thread instead of launching a device-wide kernel.
I.e. thrust::device
will just become the same as thrust::seq
when used in device code. This would probably solve the issue in terms of a sequential reduction not needing the additional scratch space, but performance would be bad to to the reduced amount of parallelism.
**: I.e. the ideal number of threads not being known at kernel launch on the host.