OpenCL: Why does the performance differ so greatly between these two cases?

Question

Here's two pieces of code from an OpenCL kernel I'm working on; they display vastly differing run-times.

The code is rather complicated, so I've simplified it right down.

This version runs in under one second:

for (int ii=0; ii<someNumber;ii++)
{
    for (int jj=0; ii<someNumber2;jj++)
    {
        value1 = value2 + value3;
        value1 = value1 * someFunction(a,b,c);
        double nothing = value1;
    }
}

and this version takes around 38 seconds to run:

for (int ii=0; ii<someNumber;ii++)
{
    for (int jj=0; ii<someNumber2;jj++)
    {
        value1 = value2 + value3;
        value1 = value1 * someFunction(a,b,c);
    }
    double nothing = value1;
}

As I say, the code is somewhat more complicated than this (there's lots of other things going on in the loops), but the variable "nothing" really does move from immediately before to immediately after the brace.

I'm very new to OpenCL, and I can't work out what is going on, much less how to fix it. Needless to say, the slow case is actually what I need in my implementation. I've tried messing around with address spaces (all variables here are in __private).

I can only imagine that for some reason the GPU is pushing the variable "value1" off into slower memory when the brace closes. Is this a likely explanation? What can I do?

Thanks in advance!

UPDATE: This runs in under one second too: (but with uncommenting of either line, it reverts to extreme slowness). This is without making any other changes to the loops, and value1 is still declared in the same place as before.

for (int ii=0; ii<someNumber;ii++)
{
    for (int jj=0; ii<someNumber2;jj++)
    {
//        value1 = value2 + value3;
//        value1 = value1 * someFunction(a,b,c);
    }
    double nothing = value1;
}

UPDATE 2: The code was actually nested in another loop like this, with the declaration of value1 as shown:

double value1=0;
for (int kk=0; kk<someNumber3;kk++)
{
    for (int ii=0; ii<someNumber;ii++)
    {
        for (int jj=0; ii<someNumber2;jj++)
        {
            value1 = value2 + value3;
            value1 = value1 * someFunction(a,b,c);
        }
        double nothing = value1;
    }
}

Moving where value1 is declared also gets us back to the speedy case:

for (int kk=0; kk<someNumber3;kk++)
{
    double value1=0;
    for (int ii=0; ii<someNumber;ii++)
    {
        for (int jj=0; ii<someNumber2;jj++)
        {
            value1 = value2 + value3;
            value1 = value1 * someFunction(a,b,c);
        }
        double nothing = value1;
    }
}

It seems OpenCL is an exceedingly tricky art! I still don't really understand what is going on, but at least I know how to fix it now!

That is pretty strange. Are you sure you need to use the slower version? From these snippets they look functionally identical. — Chriszuma, Oct 07 '11 at 16:13
Thanks for your reply. Yeah I'm sure, but you're right that the examples I've given are functionally identical. The code in the inner braces ought to have a +=. — carthurs, Oct 07 '11 at 16:22
I don't see any reason the second one should be slower based on those code snippets. I would guess that moving the assignment must have side effects somewhere, such as increased branching (one work unit executes the `if`, the next executes the `else`), which can really slow down the GPU. — Steve Blackwell, Oct 07 '11 at 16:26
Also, I know you said that all the vars are `__private`, but if you're syncing with global memory at all, you might be breaking coalesced access to memory. Optimizing OpenCL memory access can be tricky: http://stackoverflow.com/questions/3841877/what-is-a-bank-conflict-doing-cuda-opencl-programming/3842483#3842483 Just throwing out some ideas. :) — Steve Blackwell, Oct 07 '11 at 16:35
This is giving me plenty to think about. I've updated my question with something that may shed more light. The code has no `if`s. I suspect this will turn out to be a coalescing issue. — carthurs, Oct 07 '11 at 16:50
Now that I think about it, there may be some implicit `if`s, such as in the second clause of the `for` loops. As long as `someNumber*` is the same for all threads, however, I don't think that comparison should become a problem. — Steve Blackwell, Oct 07 '11 at 20:47
Also, as a rule of thumb for debugging, I'd suggest starting with small, simple kernels. Once they're loaded to the GPU, kernels are designed to start fast. I've had problems like this before where some odd line of code, which itself was not slow, somehow slowed down the whole thing tremendously. But it's easier to get a sense of the state of the compute nodes when the GPU codes are simpler and self-contained. It can be really hard to tell what the hardware is doing with those triple-nested `for` loops. HTH — Steve Blackwell, Oct 07 '11 at 20:49

score 4 · Accepted Answer · answered Oct 08 '11 at 17:06

4

What implementation are you using? I would expect the "double nothing = value1;" to be eliminated as dead code in any of the cases by any reasonable compiler.

answered Oct 08 '11 at 17:06

arsenm

2,903
1
23
23

I think I've found the problem, thanks to your post. In case 1 (the first box from my question), I think the compiler optimises by "eliminating as dead code" the inner loop. In case 2, it realises that the variable `value1` is needed outside the inner loop, so it runs it. The function `someFunction(a,b,c)` is very slow, so this causes the slowdown. FYI the implementation is AMD's SDK for Linux. Thanks for the help everyone! – carthurs Oct 11 '11 at 14:18
You're saying that because value1 is unused, the compiler optimizes away the call to someFunction. How can it be sure someFunction doesn't have a side effect? – vocaro Oct 11 '11 at 23:38
Because 'nothing' is unused. I wasn't talking about value1. – arsenm Oct 12 '11 at 00:26
@vocaro I'm not certain, but it's the best explanation I could find. It's actually a very simple function, it just hits the memory really hard (my own implementation of sparse matrix access). I don't know anything about compilers, but since it's really just a complicated de-reference, perhaps the compiler notices? Of course, it could be something completely different, but I have no idea what! – carthurs Oct 25 '11 at 21:09

DarkZeros · Answer 2 · 2011-10-08T16:54:52.897

The first case is just one loop (with compiler optimizations) But the second one is a loop with a nested loop. That is the big issue. Lots of checkings of global/local variables.(sure they are private? You declared all that inside the kernel?)

I recomend you so save as private variable (somenumber and somenumber2) before starting the loop. Because in that way you'll be checking each time with a private data. As a personal experience, every var used as the check case of a OpenCL loop must be private. This can save up to 80% of global memory access. (Especially if the loop is very short or simple)

As example, this should work fast:

int c_somenumber = someNumber;
for (int ii=0; ii<c_someNumber;ii++)
{
    int c_somenumber2 = someNumber2;    
    for (int jj=0; ii<c_someNumber2;jj++)
    {
        value1 = value2 + value3;
        value1 = value1 * someFunction(a,b,c);
    }
    double nothing = value1;
}

EDIT: Also, value1 SHOULD be cached in private memory. (As you did in your last edit)

Yep, everything is saved as private. See my explanation for the problem as a comment on the other answer! — carthurs, Oct 11 '11 at 14:20

OpenCL: Why does the performance differ so greatly between these two cases?

2 Answers2