Cuda - OpenCL CPU 4x faster than OpenCL or CUDA GPU version

Question

A wave simulator I've been working on with C# + Cudafy (C# -> CUDA or OpenCL translator) works great, except for the fact that running the OpenCL CPU version (Intel driver, 15" MacBook Pro Retina i7 2.7GHz, GeForce 650M (Kepler, 384 cores)) is roughly four times as fast as the GPU version.

(This happens whether I use the CL or CUDA GPU backend. The OpenCL GPU and CUDA versions perform nearly identically.)

To clarify, for a sample problem:

OpenCL CPU 1200 Hz
OpenCL GPU 320 Hz
CUDA GPU -~330 Hz

I'm at a loss to explain why the CPU version would be faster than the GPU. In this case, the kernel code that's executing (in the CL case) on the CPU and GPU is identical. I select either the CPU or GPU device during initialization, but beyond that, everything is identical.

Edit

Here's the C# code that launches one of the kernels. (The others are very similar.)

    public override void UpdateEz(Source source, float Time, float ca, float cb)
    {
        var blockSize = new dim3(1);
        var gridSize = new dim3(_gpuEz.Field.GetLength(0),_gpuEz.Field.GetLength(1));

        Gpu.Launch(gridSize, blockSize)
            .CudaUpdateEz(
                Time
                , ca
                , cb
                , source.Position.X
                , source.Position.Y
                , source.Value
                , _gpuHx.Field
                , _gpuHy.Field
                , _gpuEz.Field
            );

    }

And, here's the relevant CUDA kernel function generated by Cudafy:

extern "C" __global__ void CudaUpdateEz(float time, float ca, float cb, int sourceX, int sourceY, float sourceValue,  float* hx, int hxLen0, int hxLen1,  float* hy, int hyLen0, int hyLen1,  float* ez, int ezLen0, int ezLen1)
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    if (x > 0 && x < ezLen0 - 1 && y > 0 && y < ezLen1 - 1)
    {
        ez[(x) * ezLen1 + ( y)] = ca * ez[(x) * ezLen1 + ( y)] + cb * (hy[(x) * hyLen1 + ( y)] - hy[(x - 1) * hyLen1 + ( y)]) - cb * (hx[(x) * hxLen1 + ( y)] - hx[(x) * hxLen1 + ( y - 1)]);
    }
    if (x == sourceX && y == sourceY)
    {
        ez[(x) * ezLen1 + ( y)] += sourceValue;
    }
}

Just for completeness, here's the C# that is used to generate the CUDA:

    [Cudafy]
    public static void CudaUpdateEz(
        GThread thread
        , float time
        , float ca
        , float cb
        , int sourceX
        , int sourceY
        , float sourceValue
        , float[,] hx
        , float[,] hy
        , float[,] ez
        )
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
            ez[i, j] =
                ca * ez[i, j]
                +
                cb * (hy[i, j] - hy[i - 1, j])
                -
                cb * (hx[i, j] - hx[i, j - 1])
                ;

        if (i == sourceX && j == sourceY)
            ez[i, j] += sourceValue;
    }

Obviously, the if in this kernel is bad, but even the resulting pipeline stall shouldn't cause such an extreme performance delta.

The only other thing that jumps out at me is that I'm using a lame grid/block allocation scheme - ie, the grid is the size of the array to be updated, and each block is one thread. I'm sure this has some impact on performance, but I can't see it causing it to be 1/4th of the speed of the CL code running on the CPU. ARGH!

@EricBainville Sure - do you want the C#, the CUDA or CL kernels, or what? (It's a semi-mid-sized app. I don't want to paste 20k lines of code into SO) — 3Dave, May 07 '13 at 23:16
I don't see any indication that the cuda kernel is using more than 1 thread per block (there is no use of `threadIdx.x` or `threadIdx.y`). Furthermore the launch is specifying 1 thread per block. That means approximately 97% of the GPU capability is unused. I don't know much about cudafy, so I don't know if you have control over this, but I'm not at all surprised that the cuda code doesn't run impressively fast. — Robert Crovella, May 08 '13 at 00:08
When posting questions about performance please either post a reproducible or annotate the source with gridDim, blockDim, and the number of times all loops are executed. Launching blocks of 1 thread is unlikely to allow the CPU implementation to vectorize the code. On NVIDIA GPUs you will be executing at much less than 1/32 compute efficiency and on AMD GPUs you will be executing at less than 1/64 compute efficiency. I recommend you profile the GPU code. — Greg Smith, May 08 '13 at 00:14
@RobertCrovella as a naive test i set the block size to array width x 1, and the grid size to 1x array height, and got about a 300% speedup. I'm adding a feedback loop to optimize those parameters. Obviously there is a LOT of room for improvement here. If y'all want to convert your comments to answers, ill upvote and accept. — 3Dave, May 08 '13 at 19:02
Obviously if you run a grid of X threads and you use one 1 you will loose performance. But I think in order to get optimal results you sohuld convert your code to a 2D code and actively use the grid capabilities. You will then use coalesced memory accesses inside each grid giving you even much more speedups. — DarkZeros, May 15 '13 at 12:41
@DarkZeros Thanks for the insight. However, that's not at all "obvious" unless you are very familiar with the GPU architecture. — 3Dave, May 28 '13 at 02:07
@GregSmith Could you please expand on what you mean by vectorize the code? Im reading "CUDA by example" atm and just did a search in it but nothing has shown up. why is there such a big performance hit with blocks of 1 thread? — Hans Rudel, Jun 04 '13 at 16:16
@HansRudel NVIDIA GPUs See [SIMT Architecture](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation) in the CUDA C Programming Guide for more information on SIMT architecture and execution model. On compute capability 1.0-3.5 devices the WARP_SIZE is 32 threads. Specifying a block with 1 thread will result in the hardware executing 1 warp with 1 active thread and 31 inactive threads resulting in a 3% efficiency. AMD GPUs manage and execute instructions in groups of 64 threads called wavefronts. — Greg Smith, Jun 05 '13 at 00:12
If I may ask, how did you setup the environment to develop C# Cuda programs on OS X ? — Marin, Mar 22 '15 at 23:56

score 8 · Accepted Answer · answered Jun 27 '14 at 23:01

Answering this to get it off the unanswered list.

The code posted indicates that the kernel launch is specifying a threadblock of 1 (active) thread. This is not the way to write fast GPU code, as it will leave most of the GPU capability idle.

Typical threadblock sizes should be at least 128 threads per block, and higher is often better, in multiples of 32, up to the limit of 512 or 1024 per block, depending on GPU.

The GPU "likes" to hide latency by having a lot of parallel work "available". Specifying more threads per block assists with this goal. (Having a reasonably large number of threadblocks in the grid may also help.)

Furthermore the GPU executes threads in groups of 32. Specifying only 1 thread per block or a non-multiple of 32 will leave some idle execution slots, in every threadblock that gets executed. 1 thread per block is particularly bad.

Cuda - OpenCL CPU 4x faster than OpenCL or CUDA GPU version

1 Answers1