Here's two pieces of code from an OpenCL kernel I'm working on; they display vastly differing run-times.
The code is rather complicated, so I've simplified it right down.
This version runs in under one second:
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
value1 = value2 + value3;
value1 = value1 * someFunction(a,b,c);
double nothing = value1;
}
}
and this version takes around 38 seconds to run:
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
value1 = value2 + value3;
value1 = value1 * someFunction(a,b,c);
}
double nothing = value1;
}
As I say, the code is somewhat more complicated than this (there's lots of other things going on in the loops), but the variable "nothing" really does move from immediately before to immediately after the brace.
I'm very new to OpenCL, and I can't work out what is going on, much less how to fix it. Needless to say, the slow case is actually what I need in my implementation. I've tried messing around with address spaces (all variables here are in __private).
I can only imagine that for some reason the GPU is pushing the variable "value1" off into slower memory when the brace closes. Is this a likely explanation? What can I do?
Thanks in advance!
UPDATE: This runs in under one second too: (but with uncommenting of either line, it reverts to extreme slowness). This is without making any other changes to the loops, and value1 is still declared in the same place as before.
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
// value1 = value2 + value3;
// value1 = value1 * someFunction(a,b,c);
}
double nothing = value1;
}
UPDATE 2: The code was actually nested in another loop like this, with the declaration of value1
as shown:
double value1=0;
for (int kk=0; kk<someNumber3;kk++)
{
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
value1 = value2 + value3;
value1 = value1 * someFunction(a,b,c);
}
double nothing = value1;
}
}
Moving where value1
is declared also gets us back to the speedy case:
for (int kk=0; kk<someNumber3;kk++)
{
double value1=0;
for (int ii=0; ii<someNumber;ii++)
{
for (int jj=0; ii<someNumber2;jj++)
{
value1 = value2 + value3;
value1 = value1 * someFunction(a,b,c);
}
double nothing = value1;
}
}
It seems OpenCL is an exceedingly tricky art! I still don't really understand what is going on, but at least I know how to fix it now!