How to modify a CUDA code to get 100% GPU load

Question

How can I modify this code to get 100% load of my GPU?

#include <iostream>

using namespace std;

__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
    // Get the unique ID of this kernel instance
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n)
    {
        y[i] = a*x[i] + y[i];
    }
}

int main(int argc, char const *argv[])
{
    // Tensors length
    int const n = 100;

    // Define tensors
    float x[n], y[n];

    for (int i = 0; i < n; ++i)
    {
        x[i] = 1.0f*i;
        y[i] = 1.0f*i;
    }

    // Device pointers
    float *d_x, *d_y;

    cudaMalloc(&d_x, n*sizeof(float));
    cudaMalloc(&d_y, n*sizeof(float));

    if (cudaMemcpy(d_x, &x, n*sizeof(float), cudaMemcpyHostToDevice) != cudaSuccess)
    {
        printf("Memory Error!\n");
        return 0;
    }

    if (cudaMemcpy(d_y, &y, n*sizeof(float), cudaMemcpyHostToDevice) != cudaSuccess)
    {
        printf("Memory Error!\n");
        return 0;
    }

    // Run the kernel
    saxpy_parallel<<<4096, 512>>>(n, 2.0, d_x, d_y);

    // Retrieve results from the device memory
    cudaMemcpy(&y, d_y, n*sizeof(float), cudaMemcpyDeviceToHost);

    cudaFree(d_y);
    cudaFree(d_x);

    printf("%s\n",y[0]);

    system("PAUSE");
    return 0;
}

@talonmies I'm sorry. I mean this: http://www.legitreviews.com/images/reviews/1688/GPUzLoad.png Take a look at the voice "GPU Load" — Aurelius, Jan 08 '14 at 13:53
I'm afraid that the sensor parameter GPU load you are mentioning has nothing to do with CUDA programming. — Vitality, Jan 08 '14 at 14:26
@FormlessCloud: unless you can explain or provide a definition of thatexactly what that quantity measures, I don't think anyone here can help you. — talonmies, Jan 08 '14 at 14:36
@JackOLantern Please, can you exaplain me what you mean? For example if I run this: https://code.google.com/p/lbm-c/ CUDA program I get 99% on my GPU Load sensor... My question is a kind of similar to this one: http://stackoverflow.com/questions/9244481/how-to-get-100-cpu-usage-from-a-c-program but for the GPU/CUDA side. — Aurelius, Jan 08 '14 at 22:53
I mean exactly what @talonmies meant in his comment. I couldn't find the definition of _GPU load_ neither in the CUDA C Programming Guide, nor in the CUDA C Best Practices Guide. So I think _you_ have to explain something to this community. How to maximize your _GPU load_ parameter if we don't know its definition? Voting to close. — Vitality, Jan 09 '14 at 06:32
@JackOLantern Ok I got it. One more question, why Mysticial (In this question: http://stackoverflow.com/questions/9244481/how-to-get-100-cpu-usage-from-a-c-program) had not asked for a definition of 100% CPU Load? What is the difference with 100% GPU Load? I don't understand why 100% CPU Load is clear and 100% GPU Load is not. Thanks. — Aurelius, Jan 12 '14 at 14:23
I haven't said that the question you linked to is clear. The definition of GPU usage seems not to be clear to you as you haven't been able to provided it yet. It can be unpleasant to say, but this should make you think that the problem you are dealing with is not yet formally formulated in a satisfactory way. — Vitality, Jan 12 '14 at 19:57

score 2 · Answer 1 · edited Jan 08 '14 at 18:22

OK, let's ignore 100% GPU load goal as its unrealistic and not easily measurable. So let assume you want to optimize this code to run faster. What are the levers to target? Your algorithm is very simple so it does not lend itself many opportunities. However, I can see the follow targets

1) Block size

saxpy_parallel<<<4096, 512>>>

Is 512 the best number, I would start 32 or 64 and double the size as you tune the launching of the kernel to find the best value of this parameter.

2) Remove unnecessary code

if( i < n )

The if statement can be dropped if n is always less than i. This can controlled externally to the kernel. It may be necessary to pad an odd sized array to be a multiple of the block size to get this to work.

3) Explore the use of vector types

CUDA has float2 and float4 types. So rework the code to use either of these, with the HOPE that there will be faster memory access via fewer fetches and stores and arithmetic operations occurring in parallel.

4) Unjam the Loop

Each thread is currently fetching one x, a and y. Try fetching 2 or 4 or 8 values

...
 y[i] = a*x[i] + y[i];
 y[i+1] = a*x[i+1] + y[i+1];
 y[i+2] = a*x[i+2] + y[i+2];
 y[i+3] = a*x[i+3] + y[i+3];

This needs fewer threads, but each thread does more work. Try un-jamming with 2,4,6, or 8 values.

5) Store the result into a different variable

Pass an extra parameter in for the result. Then re-code

__global__ void saxpy_parallel(int n, float a, float *x, float *y, float * b)

...

  b[i] = a*x[i] + y[i];

This trades more memory for not reading and writing to same location.

Without trying each one of each approaches independently and measuring the effect before and after you won't be able to determine what works. Then some combinations may work better or worse.

Try it out and have fun and let us know!

How to modify a CUDA code to get 100% GPU load

1 Answers1