How to optimize SYCL kernel

Question

I'm studying SYCL at university and I have a question about performance of a code. In particular I have this C/C++ code:

And I need to translate it in a SYCL kernel with parallelization and I do this:

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
    int main(int argc, char** argv) {
//Create a vector with size elements and initialize them to 1
std::vector<float> dA(size); 
try {
   queue gpuQueue{ gpu_selector{} };
   buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
   gpuQueue.submit([&](handler& cgh) {
   accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }

So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?

Thanks in advance for the help!

score 1 · Accepted Answer · answered Jul 01 '22 at 08:02

Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.

Equally, if you had:

constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc

The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.

I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:

  %5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
  %add.i = fadd float %5, 2.000000e+00
  store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17

This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.

Conclusion: you've already achieved maximum efficiency, and compilers are smart!

Thanks for the answer and for the explanation! I didnì't understand the difference between the two versions that we can write, without/with constant but with your answer is everything clear — , Jul 01 '22 at 13:21

How to optimize SYCL kernel

1 Answers1