-1

This is my cuda code:


#include<stdio.h>
#include<stdint.h>
#include <chrono>
#include <cuda.h>

__global__ void test(int base, int* out)
{
    int curTh = threadIdx.x+blockIdx.x*blockDim.x;

    {
        int tmp = base * curTh;
#pragma unroll
        for (int i = 0; i<1000*1000*100; ++i) {
            tmp *= tmp;
        }
        out[curTh] = tmp;
    }
}

typedef std::chrono::high_resolution_clock Clock;
int main(int argc, char *argv[])
{
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

    int data = rand();
    int* d_out;
    void* va_args[10] = {&data, &d_out};
    int nth = 10;

    if (argc > 1) {
        nth = atoi(argv[1]);
    }

    int NTHREADS = 128;
    printf("nth: %d\n", nth);
    cudaMalloc(&d_out, nth*sizeof(int));

    for (int i = 0; i < 10; ++i) {
        auto start = Clock::now();

        cudaLaunchKernel((const void*) test,
                nth>NTHREADS ? nth/NTHREADS : 1,
                nth>NTHREADS ? NTHREADS : nth, va_args, 0, stream);

        cudaStreamSynchronize(stream);
        printf("use :%ldms\n", (Clock::now()-start)/1000/1000);
    }

    cudaDeviceReset();
    printf("host Hello World from CPU!\n");

    return 0;
}

I compile my code, and run in 2080Ti, I found the thread elapse time is around 214 ms, but the thread count is 3 times of gpu core(in 2080Ti, it's 4352)

root@d114:~# ./cutest 1 
nth: 1
use :255ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms


root@d114:~# ./cutest 13056
nth: 13056
use :272ms
use :223ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms

root@d114:~# ./cutest 21760
nth: 21760
use :472ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :428ms


So my question is Why is the elapse time the same as the number of thread increase to 3 times of gpu core?

It's mean the NVIDIA gpu computing power is 3 times of gpu core?

sorfkc
  • 13
  • 4
  • 1
    Your kernel involves a loop carried dependency chain. Is that intentional? If not, that would be your explanation why the code is not limited on instruction throughput but latency. See for example [here](https://stackoverflow.com/questions/28840271/calculating-flops/28863930#28863930). Any code with long latencies profits from the GPUs approach of using more threads than CUDA cores.That's why you won't see an increase in overall latency for a while. The extra work is hidden in the overlapping dependency chains per thread – Homer512 Jan 18 '23 at 07:26
  • OT: [Don't use `std::chono::high_resolution_clock`](https://stackoverflow.com/a/37440647/10107454), use `steady_clock` instead. – paleonix Jan 18 '23 at 13:49
  • Yes, using GPU "core" counts for performance estimation is not a universally correct metric. The situation is more complicated than that. You may wish to study unit 3 (at least) of the online training series [here](https://www.olcf.ornl.gov/cuda-training-series/). – Robert Crovella Jan 18 '23 at 15:49
  • Why are you trying to unroll a 100-million-iteration loop? – einpoklum Jan 18 '23 at 22:52

1 Answers1

1

Even though gpu-pipeline can issue a new instruction at one per cycle rate, it can overlap multiple threads' instruction running at least 3-4 times for simple math operations so increased number of threads only adds few cycles of extra latency per thread. But as it is visible at thr=21760, giving more of same instruction fully fills the pipeline and starts waiting.

21760/13056=1.667
424ms/214ms=1.98

this difference of ratios could be originated from tail-effect. When pipelines are fully filled, adding small work doubles the latency because the new work is computed as a second wave of computation after only all others completed because all they have same exact instructions. You could add some more threads and it should stay at 424ms until you get a third wave of waiting threads because again the instructions are exactly same for all threads there is no branching between threads and they work like blocks of waiting from outside.

Loop iterating for 100million times with complete dependency chain is limiting the memory accesses too. Only 1 memory operation per 100m iterations will have too low bandwidth consumption on card's memory.

The kernel is neither compute nor memory bottlenecked (if you don't count the integer multiplication with no latency-hiding in its own thread as a computation). With this, all SM units of GPU must be running with nearly same timings (with some thread-launch latency that is not visible near 100m loop and is linearly increasing with more threads).

When the algorithm is a real-world one that uses multiple parts of pipeline (not just integer multiplication), SM unit can find more threads to overlap in the pipeline. For example, if SM unit supports 1024 threads per block (and if 2 blocks in-flight maximum) and if it has only 128 pipelines, then there has to be at least 2048/128 = 16 slots to overlap operations like reading main memory, floating-point multiplication/addition, reading constant cache, shuffling registers, etc and this lets it complete a task quicker.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • Is there some document describe the detail of gpu pipeline? It maybe good way to get optimization – sorfkc Jan 18 '23 at 10:12
  • Its only reverse engineered by some but I dont remember where. But some tutorials and blogs tell about tail effect easily found in google. – huseyin tugrul buyukisik Jan 18 '23 at 15:18
  • 2
    Regarding reverse-engineered details of the pipeline, see for example [Zhe Jia et.al.: Dissecting the Nvidia Turing T4 GPU via Microbenchmarking](https://arxiv.org/pdf/1903.07486.pdf) – Homer512 Jan 18 '23 at 15:56