Can I calculate Speedup for OpenCL kernels with templates and std::index_sequence?

Question

tldr; How do I implement a for loop that runs a timed function with std::index_sequence?

Okay, I'll admit that title is a little cryptic but I was looking at this question: is that possible to have a for loop in compile time with runtime or even compile?

And I may have gotten too excited with what I could possibly do with std::index_sequence. I'll explain what my goal is. I want something like the following code:

for(int i = 1; i < 100000; ++i) 
{
    auto start = time();
    runOpenCL<i>();
    std::cout << time() - start << std::endl;
}

to compile to this (with the timers for each one):

runOpenCL<1>();
runOpenCL<2>();
runOpenCL<3>();
...
runOpenCL<100000>();

Now I thought this should just work right? Since for loops are often interpreted at compile time (if that's the right phrase) in this way. However, I understand templates have certain safeguards against this possible dodgy code so I saw that std::index_sequence could get around that, but I don't have enough of an understanding of template code to figure out whats going on. Now before anyone says I could just make it a normal function parameter and yes I could do that, if you look at the function itself:

    template<int threadcount>
    INLINE void runOpenCL()
    {
        constexpr int itemsPerThread = (MATRIX_HEIGHT + threadcount - 1) / threadcount;
    
        // executing the kernel
        clObjs.physicsKernel.setArg(2, threadcount);
        clObjs.physicsKernel.setArg(3, itemsPerThread);
    
        clObjs.queue.enqueueNDRangeKernel(clObjs.physicsKernel, cl::NullRange, cl::NDRange(threadcount), cl::NullRange);
        clObjs.queue.finish();
        
        // making sure OpenGL is finished with its vertex buffer
        glFinish();
        
        // acquiring the OpenGL object (vertex buffer) for OpenCL use
        const std::vector<cl::Memory> glObjs = { clObjs.glBuffer };
        clObjs.queue.enqueueAcquireGLObjects(&glObjs);
        
        // copying the OpenCL buffer to the BufferGL
        clObjs.queue.enqueueCopyBuffer(clObjs.outBuffer, clObjs.glBuffer, 0, 0, planets_size_points);
    
        // releasing the OpenGL object
        clObjs.queue.enqueueReleaseGLObjects(&glObjs);
    }

but I don't want to. Do I need a better reason? I think it would be really cool to implement this. Provided it is still readable in the end.

The answers show possible solutions that will work for "not too large" values but I'm not sure you have much to gain by using a template here instead of passing the threadcount as a runtime parameter — Holt, Oct 26 '21 at 12:41
The technique you are asking for is potentially useful, but the sample code strikes me as a bad use-case for it. Bloating the executable size used by that function by a factor of `10000` doesn't sound like it's worth saving one integer addition and one integer division per iteration, especially considering how much work the function does. On top of that, compilers already do this kind of optimisation in a more fine-grained manner already, and it prevent other potential optimisations (like auto-vectorization) — , Oct 26 '21 at 12:43

Holt · Answer 1 · 2021-10-26T12:57:37.317

1

Here is a possible version that will unfold the loop using C++17 fold expression:

#include <type_traits>
#include <utility>

template <std::size_t I>
void runOpenCL();

template <std::size_t... Is>
void runAllImpl(std::index_sequence<Is... >) {
    // thanks @Franck for the better fold expression
    (runOpenCL<Is>(), ...);
}

void runAll() {
    runAllImpl(std::make_index_sequence<10000>{});
}

Without C++17 you can do something like this but in non-optimized build you will get a huge stack blow-up:

#include <type_traits>
#include <utility>

template <std::size_t I>
void runOpenCL();

template <std::size_t... Is>
void runAllImpl(std::index_sequence<Is... >) {
    int arr[]{ (runOpenCL<Is>(), 0)... };
    (void)arr;
}

void runAll() {
    runAllImpl(std::make_index_sequence<10000>{});
}

This seems to work with larger value than @康桓瑋's proposition but (at least) GCC does not manage to compile for 1000000 (10000 is "ok").

edited Oct 26 '21 at 12:57

answered Oct 26 '21 at 12:40

Holt

36,600
7
92
139

@Frank That's true, although using an empty struct with a `std::initializer_list` does not seem to fix the problem (at least according to the generated assembly with gcc). – Holt Oct 26 '21 at 12:53
@Frank [Re-commenting after your edit] Your godbolt link seems to have similar problems regarding the stack, at least from what i can understand. – Holt Oct 26 '21 at 12:54
@Frank Thanks, I did something similar with a `+` fold expression (a bit less clever than yours), but then GCC does not manage to compile for 10000 (even if I don't think it's a good idea anyway... ). – Holt Oct 26 '21 at 12:58

score 0 · Answer 2 · answered Oct 26 '21 at 12:29

0

You can generate a fixed-size function table at compile-time, and invoke the corresponding function in the table through runtime index. For example like this:

#include <array>

template<std::size_t N>
constexpr auto gen_func_table = []<std::size_t... Is>
  (std::index_sequence<Is...>) {
  return std::array{+[] { runOpenCL<Is>(); }...};
}(std::make_index_sequence<N>{});

int main() {
  constexpr std::size_t max_count = 100;
  constexpr auto& func_table = gen_func_table<max_count>;
  for(int i = 1; i < max_count; ++i)
    func_table[i]();
}

Demo.

answered Oct 26 '21 at 12:29

康桓瑋

33,481
5
40
90

1

gcc has a hard time compiling this for large value of `max_count`. But this seems to be the case with any pack expansion when `sizeof... (Is)` is large. – Holt Oct 26 '21 at 12:37
@Holt. Yes, this only applies if the size of the pack expansion is not too large. – 康桓瑋 Oct 26 '21 at 12:39

Can I calculate Speedup for OpenCL kernels with templates and std::index_sequence?

2 Answers2