Why my vectorized access kernel is so slow?

Question

I am trying to understand vectorized memory access and implement a simple example to evaluate the performance. But I found that the vectorized one is slower than the naive one?

in vectorized kernel, i recast the int pointer to an int2 pointer and then do the data copy.

This is the code I used:

#include <cuda_runtime.h>
#include <stdio.h>
#include <chrono>
#include <iostream>

void initData_int(int *p, int size){
    for (int t=0; t<size; t++){
        p[t] = (int)(rand()&0xff);
    }
}

__global__ void naiveCopy(int *d_in, int *d_out, int size)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    for (int i = tid; i < size; i += blockDim.x*gridDim.x)
    {
        d_out[i] = d_in[i];
    }
}

__global__ void vecCopy(int *d_in, int *d_out, int size)
{
    int2* in = (int2*)d_in;
    int2* out = (int2*)d_out;
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    for (int i = tid; i < size/2; i += blockDim.x*gridDim.x)
    {
        out[i] = in[i];
    }

    if(tid==size/2 && size%2==1)
        d_out[size-1] = d_in[size-1];
}

int main(int argc, char **argv)
{
    int size = 1<<24;
    //int size = 128;
    int nBytes = size*sizeof(int);
    int *d_h;
    cudaMallocHost((int**)&d_h, nBytes);
    initData_int(d_h, size);

    //printData(d_h, size);

    int *res = (int*)malloc(nBytes);

    cudaStream_t stream;
    cudaStreamCreate(&stream);
    int *d_in, *d_out;
    dim3 block(128, 1);
    dim3 grid((size-1)/block.x+1, 1);
    cudaMalloc((int**)&d_in, nBytes);
    cudaMalloc((int**)&d_out, nBytes);

    cudaMemcpyAsync(d_in, d_h, nBytes, cudaMemcpyHostToDevice, stream);
    cudaStreamSynchronize(stream);
    auto s_0 = std::chrono::system_clock::now();
    naiveCopy<<<grid, block, 0, stream>>>(d_in, d_out, size);
    cudaStreamSynchronize(stream);
    auto e_0 = std::chrono::system_clock::now();
    std::chrono::duration<double> diff = e_0 - s_0;
    printf("Naive Kernel time cost is: %2f.\n", diff.count());
    
    memset(res, 0, nBytes);
    cudaMemset(d_out, 0, nBytes);
    //vectorized access:
    cudaStreamSynchronize(stream);
    s_0 = std::chrono::system_clock::now();
    vecCopy<<<grid, block, 0, stream>>>(d_in, d_out, size);
    cudaStreamSynchronize(stream);
    e_0 = std::chrono::system_clock::now();
    diff = e_0 - s_0;
    printf("Vectorized kernel time cost is: %2f.\n", diff.count());

    cudaStreamDestroy(stream);
    cudaFree(d_h);
    cudaFree(d_in);
    cudaFree(d_out);
    free(res);

    return 0;
}

This is the data from nvprof:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   89.28%  5.5024ms         1  5.5024ms  5.5024ms  5.5024ms  [CUDA memcpy HtoD]
                    4.82%  296.94us         1  296.94us  296.94us  296.94us  vecCopy(int*, int*, int)
                    3.99%  246.19us         1  246.19us  246.19us  246.19us  naiveCopy(int*, int*, int)

Could you please explain what causes the performance degradation?

Do you really believe that 50 us between a single measurement of two kernels represents a "performance degradation"? — talonmies, Sep 06 '21 at 00:27
I measure the duration for 10 times and got the average ```diff.count```, which are 0.000269 and 0.000413, respectively. That's 2 times slower. — kingwales, Sep 06 '21 at 00:52
You show the result of a single kernel run for each. Nothing more. I can only comment on what you show me. And you are using the wrong strategy for block sizes for both kernels. This is a flawed experiment and commenting on flawed experiments means discussing the flaws, not the results — talonmies, Sep 06 '21 at 01:35
why did you say block configuration is a wrong strategy? I followed the [example](https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/) from this blog to configure it. What's your suggestion... — kingwales, Sep 06 '21 at 04:54
The point of the grid stride loop is that you launch as many blocks as will fully occupy your GPU and no more, so that every thread performs multiple operations. This amortizes the cost of scheduling latency over many transactions, rather than just one: https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/ — talonmies, Sep 06 '21 at 05:30

score 1 · Accepted Answer · answered Sep 07 '21 at 01:20

You are not doing a good job of grid sizing. Your grid dimensions might be sensible for the naive kernel:

dim3 grid((size-1)/block.x+1, 1);

But they are unnecessarily twice as large as they need to be for the vectorized copy kernel.

When I cut the grid size in half for the vectorized kernel (to match the methodology for the naive kernel):

dim3 grid2((size/2+block.x-1)/block.x);

then according to my testing, the vectorized copy kernel becomes faster:

                3.88%  233.99us         1  233.99us  233.99us  233.99us  naiveCopy(int*, int*, int)
                2.84%  171.33us         1  171.33us  171.33us  171.33us  vecCopy(int*, int*, int)

Notes:

cudaFree is not the correct API to use with cudaMallocHost. The correct API is cudaFreeHost.
We can probably do a better job of grid sizing, as was mentioned in the comments, by sizing the grid to match the GPU you are running on. However we don't need to take this step in order to demonstrate the improvement here.

Hi Robert, is the performance of vectorized access related to hardware? When I ran the code on 2080ti, vectorized one is still worse. But when I ran it on V100, i got the same perf data as yours. — kingwales, Sep 07 '21 at 02:26
There can be performance differences from one GPU arch to the next for any number of reasons. I don't have a a 2080ti to work with, and furthermore saying "one is still worse" doesn't provide much to work with. If the difference is a few microseconds or a few percent, I would consider that to be "in the noise". I don't think we'll solve this in the comments, so I'm unlikely to respond to further inquiries of this type in the comments. I think to do a good job here, one would want to size the grid appropriately for each device, which has already been suggested and is a best practice. — Robert Crovella, Sep 07 '21 at 18:46

Why my vectorized access kernel is so slow?

1 Answers1

Linked