CPU overtakes the GPU

Question

I have some doubts about understanding the CUDA thread processing in the SM. The following propositions are inferred from what I have been reading: My GPU is: GTX650Ti.

Thread count in a block must be ALWAYS a multiple of the Warp size. So, each SM can process blocks of 32 threads (warpSize).
The maximum thread count my SM can compute at same time is 2048 (maxThreadsPerMultiProcessor).
Due to 2048 threads can be computed at same time in each SM and the warpSize is 32, then 64 blocks can be computed at same time.
Due to my GPU has 4 SMs, there can be 64X4=256 blocks of threads executed at same time.
Therefore, the kernel launch may have the following launch parameters: <<<256, 32>>> and each kernel launch will invoke 8192 threads.

Is that right?

So if I have a vector of 10M elements to process in my kernel it means that I have to segment it in 1221 jobs (kernel launches) of 8192 elements each?

This quest arised because I am comparing the time performance between a sequential program and my CUDA program. But all I can see is that the CPU overtakes the GPU. I also tried with the maximum launch parameters such as <<<65535, 1024>>>. The results are very similar.

So, what am I doing or configuring wrong?

This is the code I'm using:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <math.h>
#include <time.h>
#include "C:\cdev.h"
#include <thrust/device_vector.h>

using namespace thrust;
using namespace std;

#define N (1024 * 16384)

cdev devices;

__global__ void eucliDist(double *c, double *a, double *b)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        c[i] = sqrt(pow(a[i], 2) + pow(b[i], 2));
}

int main()
{
    clock_t start, end;
    double elapsed;
    static double A[N];
    static double B[N];
    for (int i = 0; i < N; i++)
    {
        A[i] = double(i);
        B[i] = double(i);
    }   
    static double C[N];

    // Sequential execution of F(x,y) = sqrt((x^2 + y^2))
    start = clock();
    for (int i = 0; i < N; i++)
        C[i] = sqrt(pow(A[i], 2) + pow(B[i], 2));
    end = clock();
    elapsed = double(end - start) / CLOCKS_PER_SEC;
    cout << "Elapsed time for sequential processing is: " << elapsed << " seconds." << endl;

    // CUDA Initialization
    unsigned int threadNum;
    unsigned int blockNum;
    cudaError_t cudaStatus;
    threadNum = devices.ID[0].maxThreadsPerBlock;
    blockNum = ceil(double(N) / double(threadNum));
    // Parallel execution with Thrust of F(x,y) = sqrt((x^2 + y^2))
    vector<double> vectorA(N);
    vector<double> vectorB(N);
    for (int i = 0; i < N; i++)
    {
        vectorA[i] = double(i);
        vectorB[i] = double(i);
    }
    vector<double> vectorC(N);
    start = clock();
    device_vector<double> thrustA(N);
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess)
    {
        cerr << "Device vector allocation failed: " << cudaGetErrorString(cudaStatus) << " (thrustA)" << endl;
        cin.get();
        return 1;
    }
    device_vector<double> thrustB(N);
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess)
    {
        cerr << "Device vector allocation failed: " << cudaGetErrorString(cudaStatus) << " (thrustB)" << endl;
        cin.get();
        return 1;
    }
    device_vector<double> thrustC(N);
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess)
    {
        cerr << "Device vector allocation failed: " << cudaGetErrorString(cudaStatus) << " (thrustC)" << endl;
        cin.get();
        return 1;
    }
    thrustA = vectorA;
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess)
    {
        cerr << "Host to device copy failed (Thrust): " << cudaGetErrorString(cudaStatus) << " (vectorA -> thrustA)" << endl;
        cin.get();
        return 1;
    }
    thrustB = vectorB;
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess)
    {
        cerr << "Host to device copy failed (Thrust): " << cudaGetErrorString(cudaStatus) << " (vectorB -> thrustB)" << endl;
        cin.get();
        return 1;
    }
    eucliDist <<<blockNum, threadNum>>>(raw_pointer_cast(thrustC.data()), raw_pointer_cast(thrustA.data()), raw_pointer_cast(thrustB.data()));
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess)
    {
        cerr << "Kernel launch failed (Thrust): " << cudaGetErrorString(cudaStatus) << " (euclidDist)" << endl;
        cin.get();
        return 1;
    }
    thrust::copy(thrustC.begin(), thrustC.end(), vectorC.begin());
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess)
    {
        cerr << "Device to host copy failed: " << cudaGetErrorString(cudaStatus) << " (thrustC -> vectorC)" << endl;
        cin.get();
        return 1;
    }
    end = clock();
    elapsed = double(end - start) / CLOCKS_PER_SEC;
    cout << "Elapsed time parallel processing is (Thrust): " << elapsed << " seconds." << endl;

    cin.get();
    return 0;
}

Suggestions will be appreciated.

Run your program in profiler and see where time is spent. Don't get hung up on hardware details, just launch the kernel once with as many blocks as are needed to cover your entire grid of 10M elements. Start with a textbook "add vector" sample to learn the technique. And GPU is not always faster than CPU; this depends on the task. — void_ptr, Nov 17 '15 at 17:57
I would say everything in you list is incorrect after the first sentence of point 1... — talonmies, Nov 17 '15 at 18:06
I can justify that. Please refer to the book CUDA Programming: a developer's guide to parallel computing with GPUs by Shane Cook, page 83, section GRIDS, paragraph 2, line 1: "The number of threads in a block should always be a multiple of the Warp size, which is currently defined as 32." May be I exaggerated when I said MUST, so this is only a recommendation to void using conditional statements to process elements off of the vector size. — Vitrion, Nov 17 '15 at 18:20
You're confused on quite a few points and effectively asking for an extensive tutorial on a number of CUDA topics in this question. As @talonmies has said, nearly all of your conjecture is wrong. Where to start? Perhaps you should study a simple code like vectorAdd and observe that it can process a large vector in a single kernel launch. You absolutely do not need 1221 kernel launches to process a 10M element vector. — Robert Crovella, Nov 17 '15 at 18:55
@void_ptr. Please watch the following screen print http://1drv.ms/1PMzRPs. This is what I got with NSight tool. I can infer that only a single SM (SM0: 83.2%) is working my kernel, the other 3 are not doing anything. Could that be the reason for this delay? — Vitrion, Nov 17 '15 at 18:55
Thank you @talonmies and Robert Crovella. Now I know that I don't have to divide the job by launching several times the kernel. But if I launch a kernel with the maximum launch parameters, i.e. <<<65535, 1024>>> I can request 67,107,840 threads, more than the 10M threads I require. So, why is it slower than the CPU, no matter what launch configuration parameters I use? — Vitrion, Nov 17 '15 at 19:06
@Vitrion: But why should it be automatically faster? You have provided absolutely no information at all about you code so it is impossible to say what the performance issues with your code might be or what you might do to fix them. — talonmies, Nov 17 '15 at 19:11
I added my code. Thank you for spending time to my question. I hope my problem is clearer to understand. — Vitrion, Nov 18 '15 at 15:56
If you have a question about timing data you're seeing, you should put that data in your question. — , Nov 18 '15 at 16:20

score 3 · Answer 1 · edited May 23 '17 at 12:15

Let's start by correcting a lot of what you posted in your question:

Thread count in a block should be always be a multiple of the Warp size. Each SM can process blocks of multiples of 32 threads (warpSize), up to a maximum of 1024 threads per block (cudaDevAttrMaxThreadsPerBlock).
The maximum thread count my SM can compute at same time is 2048 (cudaDevAttrMaxThreadsPerMultiProcessor).
Up to 16 blocks can be resident on an SM at the same time.
Due to my GPU has 4 SMs, there can be up to 16 x 4 = 64 blocks of threads executed at same time.
The kernel launch parameters can be anything up to the architecture maximum, subject to the resource limits summarised here. The maximum resident number of threads on a device is 4 x 2048 = 8192 threads.

So if I have a vector of 10M elements to process in my kernel it means that I have to segment it in 1221 jobs (kernel launches) of 8192 elements each?

No, you would launch 9766 blocks of 1024 threads in a single kernel launch. Or launch enough blocks to occupy your GPU fully (so up to 64, depending on resources), and have each thread process multiple elements of the input vector.

score 0 · Answer 2 · answered Nov 18 '15 at 16:19

0

You should break down the timings per operation; you might be doing so little work per element that you spend most of your time copying memory back and forth between the host and device.

If compute really is the problem, it is probably the operations you're trying to do. pow(x,2) is not a particularly efficient way to square a number. While this is bad on a CPU, it's especially bad on a GPU, since it might mean that you have to use the special function units, and there aren't many of those so it creates a bottleneck since that's the bulk of your calculation.

(aside: single-precision (reciprocal) square roots are handled in a different functional unit that has more throughput available for those)

To make matters worse, you are using double-precision floating point numbers; GPUs are designed to work with single-precision floating point numbers. While it can do double precision, it gets much less throughput.

Thus, you should

Compute squares with x*x rather than pow(x,2)
Use float instead of double (if adequate for your application)

answered Nov 18 '15 at 16:19

Yes I'll try it. But I wanna tell you something. I tried removing the pow and sqrt functions and performing only a simple sum. The result is the same: CPU overtakes the GPU. I've been trying to run it step by step and I realized that the problem occurs when making the copy from the host to the device and vice versa. Do you know why this happens? I'm suspecting that the use of thrust::copy() slows down the performance. – Vitrion Nov 18 '15 at 16:32
@Vitrion: Then compute is irrelevant: your benchmark is simply measuring the memory bandwidth. There is more bandwidth between host memory and the CPU than there is between host memory and device memory. – Nov 18 '15 at 16:36
@Vitrion - When you have `double`, `sqrt` and `pow` all in one place, you get low GPU compute throughput. When you do no or trivial compute (e.g. simple add), you waste time sending data between host and device. You need _lots_ of _high throughput_ compute on GPU in order to outweigh the PCIe transfer overhead. Like already said, GPUs are not just automatically faster than CPUs. – void_ptr Nov 18 '15 at 16:52

score 0 · Answer 3 · answered Nov 23 '15 at 22:52

My application uses the thrust::device_vector to allocate memory in the device. That is the reason of trying to make thrust work in my program. I finally found the problem and the solution for improving the GPU performance over the CPU. This can be useful for other users that decided to use the device_vectors instead of arrays.

As @Hurkyl said to me: I am measuring the latency of the copy between the host and the device and vice versa. All this long latency is due to the use of the following instructions:

thrustA = vectorA for copying from the host to the device. This copy operation may result clear and elegant, but take care.
thrust::copy for copying from the device to the host. This function is very similar to the use of copy with std::vector copy function.

These two operations are the bottleneck in my code.

Consider the variables:

vector<double> A;
device_vector<double> thrustA;

The solution is very simple. I just replaced these two instructions by the very well-known cudaMemcpy() function, i.e.

For copying from the host to the device:

cudaMemcpy(raw_pointer_cast(thrustA.data()), raw_pointer_cast(A.data()), A.size(), cudaMemcpyHostToDevice);

And for copying from the device to the host:

cudaMemcpy(raw_pointer_cast(A.data()), raw_pointer_cast(thrustA.data()), A.size(), cudaMemcpyDeviceToHost);

Thank you to all the people who spent their time to solve my question. Your opinions are very enriching and made me understand the CUDA much better.

CPU overtakes the GPU

3 Answers3