cuda thrust::sort met memory problem when I still have enough memory

Question

I am working with cuda10.2, on ubuntu18.04. My gpu is tesla T4, which has 16G memory, and I do not have other programs running on the current gpu. The short piece of code is like following:

#include <iostream>
#include <algorithm>
#include <random>
#include <vector>
#include <numeric>
#include <algorithm>
#include <chrono>

#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>


struct sort_functor {

    thrust::device_ptr<float> data;
    int stride = 1;
    __host__ __device__
    void operator()(int idx) {
        thrust::sort(thrust::device,
                data + idx * stride, 
                data + (idx + 1) * stride);
    }
};


int main() {
    std::random_device rd;
    std::mt19937 engine;
    engine.seed(rd());
    std::uniform_real_distribution<float> u(0, 90.);

    int M = 8;
    int N = 8 * 384 * 300;

    std::vector<float> v(M * N);
    std::generate(v.begin(), v.end(), [&](){return u(engine);});
    thrust::host_vector<float> hv(v.begin(), v.end());
    thrust::device_vector<float> dv = hv;

    thrust::device_vector<float> res(dv.begin(), dv.end());

    thrust::device_vector<int> index(M);
    thrust::sequence(thrust::device, index.begin(), index.end(), 0, 1);

    thrust::for_each(thrust::device, index.begin(), index.end(), 
            sort_functor{res.data(), N}
            );
    cudaDeviceSynchronize();

    return 0;
}

The error message is:

temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
temporary_buffer::allocate: get_temporary_buffer failed
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
Aborted (core dumped)

How could I solve this problem please ?

@talonmies thanks for telling me this, would you write an answer and explain why? — CoinCheung, Oct 20 '20 at 11:23
That isn't a solution to your problem, it is an observation. Thrust automatically knows how to handle the execution policy for host and device vectors in host code. The only time you would need execution pointers would be if you pass pointers (which have no tag based metadata) to a thrust call. Then you need the execution policy for the compiler to know how to dispatch the call to the correct backend — talonmies, Oct 21 '20 at 05:20

score 3 · Accepted Answer · answered Oct 20 '20 at 13:36

thrust::sort requires O(N) temporary memory allocation. When you call it from device code (in your functor), that temporary memory allocation (for each call - i.e. from each of your 8 calls) will be done on the device, using new or malloc under the hood, and the allocation will come out of the "device heap" space. The device heap space is by default limited to 8MB, but you can change this. You are hitting this limit.

If you add the following at the top of your main routine:

cudaError_t err = cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1048576ULL*1024);

Your code runs without any runtime errors for me.

I'm not suggesting that I calculated the 1GB value above carefully. I simply picked a value much larger than 8MB but much smaller than 16GB, and it seemed to work. You should probably carefully estimate the amount of temporary allocation size you will need, in the general case.

score 0 · Answer 2 · answered Feb 24 '23 at 13:21

While Robert Crovella's answer technically solves the problem, nesting Thrust algorithms and therefore relying on CUDA Dynamic Parallelism (CDP), when knowing the problem size, is inefficient in the first place. You can find my argument on this matter and how Thrust 1.15 deprecated CDP detailed in my answer to Why the iterating range of thrust::reduce is limited to 2048 double?.

For doing a batched sort there is cub::DeviceSegmentedSort or, as you are sorting floats, cub::DeviceSegmentedRadixSort. CUB is used in the backend of Thrust and therefore always available when Thrust (with CUDA backend) is available. These algorithms came with CUB 1.15 in October 2021, i.e. a year too late for OP.

cuda thrust::sort met memory problem when I still have enough memory

2 Answers2

Linked