Simplest Possible Example to Show GPU Outperform CPU Using CUDA

Question

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.

To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.

This ain't code, but does show in real world terms how GPUs will beat up CPUs: http://www.nvidia.com/object/nvision08_gpu_v_cpu.html — Marc B, Oct 05 '11 at 14:59
I hope my "To clarify" clause wards off those "close" votes. I think it is pretty clear what I am asking for at this point. — Chris Redford, Oct 05 '11 at 15:43
GPUs are inherently high-latency, high-bandwidth. As such, to meaningfully compare performance, you should be comparing running times for code that takes a while to run... that being said, maybe my answer? — Patrick87, Oct 05 '11 at 15:52
Not really sure what the close votes are for - it's a clear, valid question. — 3Dave, Oct 05 '11 at 15:56
@harold any game worth it's salt is going to use a lot of CPU resources even when using hardware acceleration, so that's not really a fair comparison. — 3Dave, Oct 05 '11 at 16:11
@DavidLively Warcraft 3 doesn't, it uses practically 0% CPU under normal circumstances - not worth its salt I guess? — harold, Oct 05 '11 at 16:15
@harold I'm not interested in an argument, but unless it's not loading anything from disk, using host RAM, networking, or anything else, it IS using the CPU. Any game developer will tell you that an idle CPU is a valuable resource that's not being taken advantage of, and I doubt Warcraft falls under that category. — 3Dave, Oct 05 '11 at 16:27
@DavidLively sure, it uses the CPU, otherwise it wouldn't be doing anything - but seriously it uses very little (perhaps because I force v-sync), so the unfairness would be minimal — harold, Oct 05 '11 at 16:31

score 42 · Accepted Answer · edited Oct 18 '11 at 15:13

42

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.

Use something like this in C++:

  #define N (1024*1024)
  #define M (1000000)
  int main()
  {
     float data[N]; int count = 0;
     for(int i = 0; i < N; i++)
     {
        data[i] = 1.0f * i / N;
        for(int j = 0; j < M; j++)
        {
           data[i] = data[i] * data[i] - 0.25f;
        }
     }
     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

Use something like this in CUDA/C:

  #define N (1024*1024)
  #define M (1000000)

  __global__ void cudakernel(float *buf)
  {
     int i = threadIdx.x + blockIdx.x * blockDim.x;
     buf[i] = 1.0f * i / N;
     for(int j = 0; j < M; j++)
        buf[i] = buf[i] * buf[i] - 0.25f;
  }

  int main()
  {
     float data[N]; int count = 0;
     float *d_data;
     cudaMalloc(&d_data, N * sizeof(float));
     cudakernel<<<N/256, 256>>>(d_data);
     cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
     cudaFree(d_data); 

     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.

edited Oct 18 '11 at 15:13

Chris Redford

16,982
21
89
109

answered Oct 05 '11 at 16:03

Patrick87

27,682
3
38
73

9

The Ferrari analogy was good. Thanks for thinking that explanation out and for the code. I'll try this code and see how it works. – Chris Redford Oct 05 '11 at 18:22
Will this run as-is? I'm getting an error for `identifier "cudakernel" is undefined`. I'm very new to CUDA, so I will need explicitly runnable code, please. Otherwise, I'm basically forced to learn from scratch, which is what asking this question was meant to avoid. – Chris Redford Oct 05 '11 at 23:14
Damn. Still not quite working. The j and the i are mixed up in the cudakernel. I'm getting closer to understanding how to debug it myself but I really want to make sure the operations match up. Do you mind giving the CUDA code a compile to make sure its what you expect? – Chris Redford Oct 06 '11 at 00:02
Hey, my bad. Not at a computer right now. Change "idx" to "i", and that should work. If not, I'll take a look tomorrow. – Patrick87 Oct 06 '11 at 00:10
Alright. With the idx fix, that does it. Here are some notes, though: the best value for M on my machine is 1000. It is impressive that CUDA can perform the large value you listed in about 16 seconds, while the CPU doesn't finish even after 5 minutes. HOWEVER: the value you use also crashed my video card after testing it 3 times ;) Perhaps some memory management is missing from your CUDA code? Anyhow, I'm accepting your answer. Thank you. Please make sure it is able to compile as-is when you get to a computer and check the memory management, for posterity. – Chris Redford Oct 06 '11 at 00:48
1

Yeah, you might need a cudaFree(d_data); at the end of main()... I'll test it out tomorrow and see how it does with the fix. Glad that this helped, though. – Patrick87 Oct 06 '11 at 01:33

score 4 · Answer 2 · answered Sep 19 '14 at 05:55

For reference, I made a similar example with time measurements. With GTX 660, the GPU speedup was 24X where its operation includes data transfers in addition to actual computation.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <time.h>

#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024

void serial_add(double *a, double *b, double *c, int n, int m)
{
    for(int index=0;index<n;index++)
    {
        for(int j=0;j<m;j++)
        {
            c[index] = a[index]*a[index] + b[index]*b[index];
        }
    }
}

__global__ void vector_add(double *a, double *b, double *c)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
        for(int j=0;j<M;j++)
        {
            c[index] = a[index]*a[index] + b[index]*b[index];
        }
}

int main()
{
    clock_t start,end;

    double *a, *b, *c;
    int size = N * sizeof( double );

    a = (double *)malloc( size );
    b = (double *)malloc( size );
    c = (double *)malloc( size );

    for( int i = 0; i < N; i++ )
    {
        a[i] = b[i] = i;
        c[i] = 0;
    }

    start = clock();
    serial_add(a, b, c, N, M);

    printf( "c[0] = %d\n",0,c[0] );
    printf( "c[%d] = %d\n",N-1, c[N-1] );

    end = clock();

    float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
    printf("Serial: %f seconds\n",time1);

    start = clock();
    double *d_a, *d_b, *d_c;


    cudaMalloc( (void **) &d_a, size );
    cudaMalloc( (void **) &d_b, size );
    cudaMalloc( (void **) &d_c, size );


    cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
    cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );

    vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );

    cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );


    printf( "c[0] = %d\n",0,c[0] );
    printf( "c[%d] = %d\n",N-1, c[N-1] );


    free(a);
    free(b);
    free(c);
    cudaFree( d_a );
    cudaFree( d_b );
    cudaFree( d_c );

    end = clock();
    float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
    printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);

    return 0;
}

3Dave · Answer 3 · 2011-10-05T15:19:27.593

4

A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)

Are you looking for example code, or just ideas?

Edit

The 40x was vs a dual core CPU, not a quad core.

Some pointers:

Make sure you're not running, say, Crysis while running your benchmarks.
Shot down all unnecessary apps and services that might be stealing CPU time.
Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)

As I said in my comment response to @Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.

(These are probably pretty obvious in retrospect.)

edited Oct 05 '11 at 15:19

answered Oct 05 '11 at 15:05

3Dave

28,657
18
88
151

Were you using all 4 cores of the CPU or just 1 ? – Paul R Oct 05 '11 at 15:09
All four. The nice thing about OpenCL vs CUDA is that it'll run on just about any device (CPU, GPU, Cell, etc), so it's pretty easy to compare results. The C++ implementation was comparable to the OpenCL version running on the CPU on a single core. I used OpenCL on the CPU to use all four cores for the final comparison. – 3Dave Oct 05 '11 at 15:16
@Paul R - correction, the 40x was on a dual-core CPU using both cores. On the quad-core I got about 30x (not 20 as I would have expected). – 3Dave Oct 05 '11 at 15:21
OK - thanks - and are you including the time taken to copy the results from the GPU back to the host memory ? – Paul R Oct 05 '11 at 15:22
Yep. That was running on a minimal Windows 7 install with everything disabled that wasn't required for the simulation. – 3Dave Oct 05 '11 at 15:26
Thanks for the clarification - it would be interesting to optimise the CPU version, e.g. using SSE, to see what the ratio is in the case of highly optimised CPU code versus GPU. – Paul R Oct 05 '11 at 15:35
Hi David. Thanks for the info. Yes, I would really appreciate some actual code. OpenCL code is fine. Also, I am just looking for a simple example that could execute on the scale of seconds or milliseconds, showing the GPU outperform. – Chris Redford Oct 05 '11 at 15:40
@ChrisRedford I'll see what I can dig up. – 3Dave Oct 05 '11 at 15:54
@PaulR The OpenCL CPU code is compiled by vendor-specific drivers (in this case, by the Intel compiler) at run-time. I seem to remember that it uses SSE, performs most standard optimizations, etc., but I could be wrong. – 3Dave Oct 05 '11 at 15:55
OpenCL is probably only fair if you're using ATI or some other graphics brand besides NVIDIA. I believe that CUDA kills OpenCL in performance comparisons on NVIDIA devices... I know that was true at one point, but maybe this has changed. – Patrick87 Oct 05 '11 at 16:10
CUDA had a big head start relative to OpenCL so it's likely that OpenCL still has some catching up to do. – Paul R Oct 05 '11 at 16:35

Brad Larson · Answer 4 · 2018-01-29T14:59:43.870

I agree with David's comments about OpenCL being a great way to test this, because of how easy it is to switch between running code on the CPU vs. GPU. If you're able to work on a Mac, Apple has a nice bit of sample code that does an N-body simulation using OpenCL, with kernels running on the CPU, GPU, or both. You can switch between them in real time, and the FPS count is displayed onscreen.

For a much simpler case, they have a "hello world" OpenCL command line application that calculates squares in a manner similar to what David describes. That could probably be ported to non-Mac platforms without much effort. To switch between GPU and CPU usage, I believe you just need to change the

int gpu = 1;

line in the hello.c source file to 0 for CPU, 1 for GPU.

Apple has some more OpenCL example code in their main Mac source code listing.

Dr. David Gohara had an example of OpenCL's GPU speedup when performing molecular dynamics calculations at the very end of this introductory video session on the topic (about around minute 34). In his calculation, he sees a roughly 27X speedup by going from a parallel implementation running on 8 CPU cores to a single GPU. Again, it's not the simplest of examples, but it shows a real-world application and the advantage of running certain calculations on the GPU.

I've also done some tinkering in the mobile space using OpenGL ES shaders to perform rudimentary calculations. I found that a simple color thresholding shader run across an image was roughly 14-28X faster when run as a shader on the GPU than the same calculation performed on the CPU for this particular device.

dead link "N-body simulation using OpenCL" >>> https://developer.apple.com/library/content/samplecode/OpenCL_NBody_Simulation/Introduction/Intro.html? — Dave Engineer, Jan 29 '18 at 13:36
@DaveEngineer - Apple always moves those around. I believe I've found the new locations for them. — Brad Larson, Jan 29 '18 at 15:00

Simplest Possible Example to Show GPU Outperform CPU Using CUDA

4 Answers4

Linked