-4

I am currently experimenting with CUDA in C++.
I know how CPUs and GPUs work and how they are supposed to function.

I wrote a testing program that does vector additions on big arrays.
On the CPU, this finishes in approximately 18866700ns.

Running this on a single thread and a single block on the GPU only needs 51300ns.
How can this be? A single thread on the GPU should be slower than one CPU core and not ~370x faster, shouldn't it?

Here is my code:

#define COUNT 10000000

// 18866700ns runtime
void vector_add(float* out, float* a, float* b, int n)
{
    for (int i = 0; i < n; i++)
        out[i] = a[i] + b[i];
}


// 51300ns runtime
__global__ void vector_add_cuda(float* out, float* a, float* b, int n)
{
    for (int i = 0; i < n; i++)
        out[i] = a[i] + b[i];
}


int main()
{
    float* out = new float[COUNT], * a = new float[COUNT], * b = new float[COUNT];
    for (int i = 0; i < COUNT; i++)
    {
        a[i] = 2 * i;
        b[i] = COUNT - i;
    }

    float* d_out, * d_a, * d_b;
    cudaMalloc(&d_out, sizeof(float) * COUNT);
    cudaMalloc(&d_a, sizeof(float) * COUNT);
    cudaMalloc(&d_b, sizeof(float) * COUNT);

    cudaMemcpy(d_a, a, sizeof(float) * COUNT, cudaMemcpyDefault);
    cudaMemcpy(d_b, b, sizeof(float) * COUNT, cudaMemcpyDefault);

    auto start = std::chrono::high_resolution_clock::now();

    //only one thread and block here!
    vector_add_cuda<<<1, 1>>>(d_out, d_a, d_b, COUNT);

    //vector_add(out, a, b, COUNT);

    auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();

    cudaMemcpy(out, d_out, sizeof(float) * COUNT, cudaMemcpyDefault);

    cudaFree(d_out);
    cudaFree(d_a);
    cudaFree(d_b);
}

I have no idea why this is and I can't find anything online.
Can someone explain this to me?

RD4
  • 135
  • 1
  • 12
  • At this point I'm not yet sure you're measuring what you'd like to measure. Compiler can reorder _many_ things. Have you made it sure that the data you pass to GPU is actually sent, processed and sent back? No, simply writing calls in order does not prevent reordering. – lorro Dec 20 '22 at 15:34
  • 5
    Does this answer your question? [Are cuda kernel calls synchronous or asynchronous](https://stackoverflow.com/questions/8473617/are-cuda-kernel-calls-synchronous-or-asynchronous) By the time when `elapsed` is calculated, the CUDA kernel may have not finished its computation yet. – kotatsuyaki Dec 20 '22 at 15:35
  • 1
    See [`cudaDeviceSynchronize`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g10e20b05a95f638a4071a655503df25d). – molbdnilo Dec 20 '22 at 15:45

1 Answers1

-4

I put the cudaMemcpy(out, d_out, sizeof(float) * COUNT, cudaMemcpyDefault); before elapsed and that fixed it. It now takes 8152790100ns. I assume this is because the computation isn't yet finished when elapsed is called.

RD4
  • 135
  • 1
  • 12