CUDA thrust sort much slower when called from inside kernel

Question

#include <iostream>
#include <math.h>
#include <vector>
#include <assert.h>
#include <fstream>
#include <map>
#include <algorithm>
#include <sstream>
#include <cuda_runtime_api.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/functional.h>
#include <thrust/execution_policy.h>
#include <cub/cub.cuh>
using namespace std;
typedef float real;

int MAX_N = 10000000;
int N;
real* a, *b;
real* d_a;
real* h_res1, *h_res2;
volatile real v_res = 0;

class MyTimer {
    std::chrono::time_point<std::chrono::system_clock> start;

public:
    void startCounter() {
        start = std::chrono::system_clock::now();
    }

    int64_t getCounterNs() {
        return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count();
    }

    int64_t getCounterMs() {
        return std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count();
    }

    double getCounterMsPrecise() {
        return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count()
                / 1000000.0;
    }
};

void genData()
{
    N = 100000;    
    for (int i = 0; i < N; i++) a[i] = float(rand() % 1000) / (rand() % 1000 + 1);
}

void __attribute__((noinline)) testCpu(real* arr, real* res, int N) 
{
    std::sort(arr, arr + N);
    v_res = arr[rand() % N];
    memcpy(res, arr, N * sizeof(real));
}

__global__
void sort_kernel(float* a, int N)
{
    if (blockIdx.x==0 && threadIdx.x==0)
        thrust::sort(thrust::device, a, a + N);
    __syncthreads();
}

void __attribute__((noinline)) testGpu(real* arr, real* res, int N)
{
    MyTimer timer;

    timer.startCounter();
    cudaMemcpy(d_a, arr, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaDeviceSynchronize();
    cout << "Copy H2D cost = " << timer.getCounterMsPrecise() << "\n";

    timer.startCounter();
    //thrust::sort(thrust::device, d_a, d_a + N);
    sort_kernel<<<1,1>>>(d_a, N);
    cudaDeviceSynchronize();
    cout << "Thrust sort cost = " << timer.getCounterMsPrecise() << "\n";

    timer.startCounter();
    cudaMemcpy(res, d_a, N * sizeof(float), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    cout << "Copy D2H cost = " << timer.getCounterMsPrecise() << "\n";

    v_res = res[rand() % N];
}

void __attribute__((noinline)) deepCopy(real* a, real* b, int N) 
{
    for (int i = 0; i < N; i++) b[i] = a[i];
}

void testOne(int t, bool record = true)
{
    MyTimer timer;

    genData();
    deepCopy(a, b, N);

    timer.startCounter();
    testCpu(a, h_res1, N);
    cout << "CPU cost = " << timer.getCounterMsPrecise() << "\n";

    timer.startCounter();
    testGpu(b, h_res2, N);
    cout << "GPU cost = " << timer.getCounterMsPrecise() << "\n";

    for (int i = 0; i < N; i++) {
        if (h_res1[i] != h_res2[i]) {
            cout << "ERROR " << i << " " << h_res1[i] << " " << h_res2[i] << "\n";
            exit(1);
        }
    }

    cout << "-----------------\n";
}


int main()
{
    a = new real[MAX_N];
    b = new real[MAX_N];
    cudaMalloc(&d_a, MAX_N * sizeof(float));
    cudaMallocHost(&h_res1, MAX_N * sizeof(float));
    cudaMallocHost(&h_res2, MAX_N * sizeof(float));

    testOne(0, 0);
    for (int i = 1; i <= 50; i++) testOne(i);
}

For legacy code reason, I have to perform sort inside a kernel completely. Basically, I need:

__global__ void mainKernel(float** a, int N, float* global_pad)
{
int x;
... 
cooperative_groups::grid_group g = cooperative_groups::this_grid(); 
sortFunc(a[x], N); // this can be a kernel. Then only 1 thread in the grid will call it
g.sync();
...
}

I tried to use thrust::sort but it's extremely slow. For example, with N = 100000, the benchmark result is:

CPU cost = 5.82228
Copy H2D cost = 0.088908
Thrust sort from CPU cost = 0.391211 (running line thrust::sort(thrust::device, d_a, d_a + N);)
Thrust sort inside kernel cost = 116 (running line sort_kernel<<<1,1>>>(d_a, N);)
Copy D2H cost = 0.067639

Why is thrust::sort so slow in this case? I want to find an implementation of sortFunc that is fastest possible (global_pad can be used as temporary memory)

Edit: I'm using 2080ti and CUDA 11.4. The compile command I use is nvcc -o main main.cu -O3 -std=c++17

Your "inside the kernel" case is doing the sort with a single thread on the GPU. Why is it surprising that it is much slower than using the whole GPU to perform the sort, as happens with using the host side sort API? — talonmies, Dec 22 '22 at 10:35
There's a mistake in the code, it's `thrust::device` not `thrust::seq`. Does it not launch a child kernel? — Huy Le, Dec 22 '22 at 10:41
Basically, I need `sortFunc()` to perform the same as calling `thrust::sort` from host. Temporary memory is already allocated in `global_pad` — Huy Le, Dec 22 '22 at 10:47
CUDA Dynamic Parallelism like this has a big overhead. CUDA 12 has a new version of dynamic parallelism, but it doesn't allow accessing the results of child-kernels. I don't think Thrust was adapted to this yet. Thrust deprecated calling the non-sequential algorithms inside kernels due to the coming change in CUDA. — paleonix, Dec 22 '22 at 12:30
I'm not quite sure how to interpret the deprecation note in the [Thrust 1.16 release notes](https://github.com/NVIDIA/thrust/releases/tag/1.16.0). It might be that you are already getting `seq` behavior when calling with `device` execution policy in device code. What CUDA and Thrust version are you using? — paleonix, Dec 22 '22 at 12:34
Your posted code doesn't create the printout you have shown. There is nothing in your posted code that prints out "Thrust sort from CPU cost". When I compile and run your code normally on a V100, I get a CPU cost of 23 and a thrust sort cost of 0.6. So I think you should provide a very complete test case, including the GPU you are running on, the CUDA version, the OS, and the compile command line. Ordinarily to take advantage of CDP with legacy thrust, you would need to compile for CDP, including `-rdc=true`. But even without that, I don't get anything like your numbers. — Robert Crovella, Dec 22 '22 at 15:50
@RobertCrovella the 2 results correspond with commenting one of 2 lines `thrust::sort(thrust::device, d_a, d_a + N);` and `//sort_kernel<<<1,1>>>(d_a, N);`. Could you run again, this time running the line `sort_kernel...` ? — Huy Le, Dec 23 '22 at 02:39
@paleonix I'm using 2080ti and CUDA 11.4. The command line I use is `nvcc -o main main.cu -O3 -std=c++17` — Huy Le, Dec 23 '22 at 02:40
Ok, then your Thrust version is certainly lower than 1.16 and it should have nothing to do with the deprecation. — paleonix, Dec 23 '22 at 12:02

Huy Le · Answer 1 · 2022-12-23T10:33:23.183

-1

~~You need to turn on dynamic parallelism in the compile command.~~

Use -rdc=true, nvcc -o main main.cu -O3 -std=c++17 -rdc=true.

Then the 2 block code below are equivalent

__global__
void sort_kernel(float* a, int N)
{
    if (blockIdx.x==0 && threadIdx.x==0)
        thrust::sort(thrust::device, a, a + N);
    __syncthreads();
}
...
sort_kernel<<<1,1>>>(d_a, N);

and

thrust::sort(thrust::device, d_a, d_a + N);

edited Dec 23 '22 at 10:33

answered Dec 23 '22 at 03:18

Huy Le

1,439
4
19

`-rdc` does not turn on dynamic parallelism AFAIK, it results in relocatable device code, which must then be (device-)linked before it can be used. that's used to compile different parts of your kernel separately. – einpoklum Dec 23 '22 at 10:12
Oh, that's strange. I just know that if I include `-rdc` then it works – Huy Le Dec 23 '22 at 10:32
The [documentation](https://docs.nvidia.com/cuda/archive/11.4.4/cuda-c-programming-guide/index.html#compiling-and-linking) says that you need to link with `-lcudadevrt` (in addition to using `-rdc=true`). But I guess if the linker doesn't give you an error, it just did that automatically? – paleonix Dec 23 '22 at 12:09
1

@paleonix: nvcc might be doing that for you I suppose. – einpoklum Dec 23 '22 at 13:09
@einpoklum It has to based on OPs observation, although I wonder if this works independent of the CUDA version. If you put it into a Makefile, I would certainly prefer the explicit version. – paleonix Dec 23 '22 at 13:11
2

yes, newer versions of nvcc include `-lcudadevrt` automatically, as needed. The complete enablement of CDP required 1. an arch specification that supports CDP (no longer necessary with newer `nvcc` versions, as the default suffices) 2. enablement of relocatable device code with device linking 3. addition of the device runtime library (no longer necessary with newer `nvcc` versions) Once the proper environment is enabled, legacy thrust will detect this and use CDP automatically where it makes sense, when thrust algorithms are used in device code with an appropriate execution policy. – Robert Crovella Dec 23 '22 at 14:38

CUDA thrust sort much slower when called from inside kernel

1 Answers1