Why is the iterating range of thrust::reduce limited to 2048 doubles in device code?

Question

I am using the NVIDIA HPC SDK (2022) to compile the following code, the basic purpose of which is to sum a NxM matrix into a vector of size N.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/execution_policy.h>
#include <thrust/execution_policy.h>
#include <thrust/fill.h>

constexpr unsigned int N = 2048, M = 2048;

int main(int argc, char* argv[]) {

    thrust::device_vector<double> g_vec1(N*M);
    thrust::device_vector<double> g_vec2(N);
    thrust::fill(thrust::device, g_vec1.begin(),g_vec1.end(),1.);

    
    thrust::device_vector<thrust::
            device_vector<double>::iterator> g_it_vec(N);

    for (int i=0; i<N; i++)
        g_it_vec[i] = g_vec1.begin() + i*M;

            
    thrust::transform(g_it_vec.begin(),g_it_vec.end(),g_vec2.begin(),
        [](const auto& it) {
            return thrust::reduce(thrust::device,
                it, it+M,0.);});

}

When I run this code on a Geforce RTX 3080Ti, an error occurs for M > 2048 doubles (or M > 1024 when I use complex doubles):

temporary_buffer::allocate: get_temporary_buffer failed
…
temporary_buffer::allocate: get_temporary_buffer failed
terminate called after throwing an instance of ‘thrust::system::system_error’
what(): transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
Aborted (core dumped)

How did this happen? Is it related to the 1024 maximum thread number of a block?

Or is there any standard means to reduce a matrix (2d array) along the inner dimension?

Maybe you reached the max size you can allocate with `device_vector`? You might refer this to check the max size you are allowed: https://stackoverflow.com/questions/6542342/determine-max-length-of-thrustdevice-vector — Ranoiaetep, Nov 20 '22 at 12:58
@Ranoiaetep if I put a std::cout <<"aa" << std::endl just before the thrust::tranform, the "aa" always show up even if the error occur, so I think N*M vector is successfully built — batman216, Nov 20 '22 at 14:42
Maybe this one can help you: https://stackoverflow.com/questions/64441827/cuda-thrustsort-met-memory-problem-when-i-still-have-enough-memory — Ranoiaetep, Nov 20 '22 at 22:04
If you are using `-stdpar` to get `nvc++` to include the Thrust headers that is probably a bad idea as it will also transform all heap memory to CUDA managed memory (needed when offloading C++17 parallel algorithms to the GPU). Instead use `-I/path/to/thrust/headers` like you would do with `g++` when not using the device backend. The NVHPC SDK actually ships with multiple locations of Thrust headers, one with the `nvc++` compiler and one with each version of the CUDA toolkit. For using Thrust directly would try to use a Toolkit version of Thrust or a newer one from Github. — paleonix, Nov 20 '22 at 23:17
@paleonix I take your advise and compile the code with " nvc++ -I/lustre/opt/nvidia/hpc_sdk/Linux_x86_64/2022/cuda/include/ main.cpp", but the compilation failed with error: static assertion failed with "unimplemented for this system"..., is there anything else i miss? — batman216, Nov 21 '22 at 03:00

paleonix · Accepted Answer · 2023-02-24T12:57:22.843

Context

thrust::reduce uses cub::DeviceReduce in the backend. The documentation says that

DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

When Thrust allocates scratch space for DeviceReduce with cudaMalloc in device code, it probably runs into size limitations:

cudaMalloc() and cudaFree() have distinct semantics between the host and device environments. [...] When invoked from the device runtime these functions map to device-side malloc() and free(). This implies that within the device environment the total allocatable memory is limited to the device malloc() heap size, which may be smaller than the available unused device memory.

Solution Candidate 1: Increase `cudaLimitMallocHeapSize`

This limit can be set with cudaDeviceSetLimit(cudaLimitMallocHeapSize, value) which should enable bigger reductions.

Solution Candidate 2: Use Custom Allocators (untested)

When calling Thrust algorithms on the host, one can avoid temporary allocations for scratch space by passing a custom allocator to the execution policy as demonstrated in examples/cuda/custom_temporary_allocation.cu. I have never tried to do this in device code so I can't guarantee that this is implemented, but one could try to do the allocations on the host, wrap them in an allocator on the device and pass them to thrust::device. If this works, doing a single big allocation on the host should be much more efficient than doing many (N) small allocations on the device.

Solution Candidate 3: Avoid Using CUDA Dynamic Parallelism (preferred)

You can find multiple alternative ways of implementing this reduction in my answers to How to do a reduction over one dimension of 2D data in Thrust and Parallelization of a for loop consisting of Thrust Transforms. In short I would recommend using cub::DeviceSegmentedReduce (from the host).

I prefer avoiding CUDA Dynamic Parallelism (CDP) for this problem not only because it is deprecated* in Thrust >= 1.15, but also because I do not expect it to give good performance for problems which do not need CDP due to the amount of parallelism being data dependent**. The very regular problem of reducing a dimension of a dense matrix does not need Dynamic Parallelism and therefore performance should not have to suffer from CDP overheads.

*: CUDA 12 introduced a new CDP API disallowing cudaDeviceSynchronize() in device code which Thrust needs to keep usage consistent between host and device. The Thrust CDP deprecation notice says

A future version of Thrust will remove support for CUDA Dynamic Parallelism (CDP). This will only affect calls to Thrust algorithms made from CUDA device-side code that currently launches a kernel; such calls will instead execute sequentially on the calling GPU thread instead of launching a device-wide kernel.

I.e. thrust::device will just become the same as thrust::seq when used in device code. This would probably solve the issue in terms of a sequential reduction not needing the additional scratch space, but performance would be bad to to the reduced amount of parallelism.

**: I.e. the ideal number of threads not being known at kernel launch on the host.

Why is the iterating range of thrust::reduce limited to 2048 doubles in device code?

1 Answers1

Context

Solution Candidate 1: Increase `cudaLimitMallocHeapSize`

Solution Candidate 2: Use Custom Allocators (untested)

Solution Candidate 3: Avoid Using CUDA Dynamic Parallelism (preferred)

Linked

Why is the iterating range of thrust::reduce limited to 2048 doubles in device code?

1 Answers1

Context

Solution Candidate 1: Increase cudaLimitMallocHeapSize

Solution Candidate 2: Use Custom Allocators (untested)

Solution Candidate 3: Avoid Using CUDA Dynamic Parallelism (preferred)

Linked

Solution Candidate 1: Increase `cudaLimitMallocHeapSize`