using templates for efficient pixel operations

Question

I would like to build a set of pixel conversion routines that I can add together to efficiently transform pixels in an inner loop.

So, for example

template<typename op1, typename op2, .... typename opN> trans(uint8_t data, uint32_t w, uint32_t h){
   uint64_t index = 0;
   for(uint32_t j = 0; j < h; ++j)
     for(uint32_t i = 0; i < h; ++i) {
         data[index] = opN(opN-1 .....op1(data[index]));
         index++
     }
}

and I want the compiler to inline all ops so that this is efficient as if I explicitly defined the operations by hand in the inner loop.

Is this possible ? If I use the inline key, and the ops are simple, can I guarantee that compiler will remove function calls and inline everything ?

Somewhat related: `inline` isn't as useful a performance tool as you'd like. The only times an optimizing compiler doesn't ignore it outright are the times that it would have inlined the function anyway. Longer discussion:[When should I write the keyword 'inline' for a function/method?](https://stackoverflow.com/questions/1759300/when-should-i-write-the-keyword-inline-for-a-function-method) — user4581301, Aug 10 '21 at 21:34

Fatih BAKIR · Accepted Answer · 2021-08-10T22:40:24.653

Guaranteed inlining is not possible with standard C++. However, if you are willing to use a compiler extension, it is doable. GCC (and Clang) provide an attribute called flatten1 that is supposed to do this*:

template <class... Ops>
[[gnu::flatten]] // Everything in the body will be inlined if possible
void process(uint8_t* data, int w, int h, Ops&&... ops) {
    int index = 0;
    for (int i = 0; i < h; ++i) {
        for (int j = 0; j < w; ++j) {
            ((data[index] = ops(data[index])), ...);
            ++index;
        }
    }
}

The fold expression should be okay as operator, is a sequence point.

Use and disassembly here: https://godbolt.org/z/x945nM8v8

However, I'd recommend profiling with and without flatten to make sure it actually improves performance in real applications. Flattening may increase I-cache misses, actually hurting your performance.

*: The documentation on this is sparse, and I am not 100% sure if it actually applies to the template based use we have here.

Thanks. Since my main target OS is linux, I would be happy with flatten if it works as expected. I am creating this framework to make better use of memory cache on system; currently I have to go through the data multiple times to perform various operations. So, sacrifice I-cache perhaps for better L3 cache performance — Jacko, Aug 10 '21 at 23:05

using templates for efficient pixel operations

1 Answers1