1

I have a big function, which for the MWE below is called modify_vec. In it, there are for loops where an operation is performed many times (e.g., for each element of a matrix).

I would like the function to be capable of applying different operations, depending on different given inputs, without using if-else statements within the loop.

From my (limited) understanding of compiler optimizations using SIMD, I would have to: either manually write all possible combinations of the operations for the function, so that the compiler knows at compile-time which operation to optimize for; or using templates somehow.

But I do not know how to perform this using templates. Instead, I've tried with std::map<const int, const std::function>.

Below are some code snippets and a MWE, which can also be seen in Compiler Explorer: https://godbolt.org/z/z777aq5ab.

Bonus question: since none of my actual vectors are defined at compile-time, does "constexpr" even help?


Possible solution based on jwezorek's answer below



A version of modify_vec that only works for a specific order of operations, where p is an argument common to all possible operations.

void modify_vec(std::vector<double> &inp_vec,
                const std::vector<double> &p) {
    const auto n = inp_vec.size();
    for (auto i=0; i<n; ++i) {
        inp_vec[i] = inp_vec[i]*p[0]+3.;  // this is operation 1 in the MWE
    }
    for (auto i=0; i<n; ++i) {
        inp_vec[i] = inp_vec[i]/p[1]-4.;  // this is operation 2 in the MWE
    }
}

My attempt at generalizing it, with only two distinct operations, although more may be used in my actual problem:

// Are these functions inlined with GCC+11?
constexpr double first_op(const double input, const double p) {
    return input*p+3.;
}
constexpr double second_op(const double input, const double p) {
    return input/p-4.;
}
std::map<const int,
         const std::function<double(const double, const double)>> op {
    {0, first_op},
    {1, second_op}
};
void modify_vec_gen(std::vector<double> &inp_vec,
                    const std::vector<double> &p,
                    const int op1, const int op2) {
    const auto n = inp_vec.size();
    for (auto i=0; i<n; ++i) {
        inp_vec[i] = op[op1](inp_vec[i], p[0]);  // Can SIMD be used here?
    }
    for (auto i=0; i<n; ++i) {
        inp_vec[i] = op[op2](inp_vec[i], p[1]);
    }
}

My main file, main.cpp, contains the above code snippets and the following, compiled with g++ -std=c++11 -O3 -ffast-math -mavx2 main.cpp -o main.

#include <iostream>
#include <vector>
#include <functional>
#include <map>

/* Copy and paste the above snippets here */

template <typename T>
constexpr void print_vec(const std::vector<T> &vec) {
    for (const auto v : vec) {
        std::cout << v << " ";
    }
    std::cout<<std::endl;
}

int main(int argc, char** argv) {
    int op1 = std::stoi(argv[1]);
    int op2 = std::stoi(argv[2]);
    
    auto vec1 = std::vector<double>{2., 5., -3., 10.};
    const auto p = std::vector<double>{3., 2.};
    auto vec2{vec1};
    auto vec3{vec1};

    std::cout << "Initial vector:" << std::endl;
    print_vec(vec1);

    std::cout << "After operations 0 and 1:" << std::endl;
    modify_vec(vec1, p);
    print_vec(vec1);

    std::cout << "After operations " << op1 << " and " << op2 << ":" << std::endl;
    modify_vec_gen(vec2, p, op1, op2);
    print_vec(vec2);

    std::cout << "After operations " << op1 << " and " << op1 << ":" << std::endl;
    modify_vec_gen(vec3, p, op1, op1);
    print_vec(vec3);

    return 0;
}

Running with ./main 0 1 yields:

Initial vector:
2 5 -3 10 
After operations 0 and 1:
0.5 5 -7 12.5 
After operations 0 and 1:
0.5 5 -7 12.5 
After operations 0 and 0:
21 39 -9 69

Possible olution

A possible, based on jwezorek's answer is shown below. In Compiler Explorer: https://godbolt.org/z/K3KefzTYW.

Compiled using -O3 -ffast-math -mavx2 -std=c++11 -fopt-info-vec-all -Wall -pedantic -Werror in gcc, -O3 -ffast-math -mavx2 -std=c++11 -Rpass-analysis=loop-vectorize -Wall -pedantic -Werror in clang, and /O2 /FA /arch:AVX2 /std:c11 /Qvec-report:2 in MSVC.

But MSVC is not able to vectorize the code, only GCC is.

The helper function call_modify_vec is pretty ugly, but, since I will only use two distinct operations, the first having 2 possibilities and the other 4, I guess it's good enough.

#include <iostream>
#include <vector>

#ifdef _WIN32
#define MYRESTRICT __restrict
#else
#define MYRESTRICT __restrict__
#endif

constexpr double first_op(const double input, const double p) {
    return input*p+3.;
}
constexpr double second_op(const double input, const double p) {
    return input/p-4.;
}
using op_fn = double(*)(const double input, const double p);
const op_fn op_vec[] = {first_op, second_op};

template<int OP1, int OP2>
void modify_vec_template(double * MYRESTRICT inp_vec,
                         const std::size_t vec_size,
                         const std::vector<double> &p) {
    std::size_t i;
    for (i=0; i<vec_size; ++i) {
        inp_vec[i] = op_vec[OP1](inp_vec[i], p[0]);
    }
    for (i=0; i<vec_size; ++i) {
        inp_vec[i] = op_vec[OP2](inp_vec[i], p[1]);
    }
}

void call_modify_vec(std::vector<double> &in_vec,
                     const std::vector<double> &p,
                     const int op1, const int op2) {
    if ( (op1 == 0) && (op2 == 0)) {
        modify_vec_template<0, 0>(in_vec.data(), in_vec.size(), p);
    } else if ( (op1 == 0) && (op2 == 1)) {
        modify_vec_template<0, 1>(in_vec.data(), in_vec.size(), p);
    } else if ( (op1 == 1) && (op2 == 0)) {
        modify_vec_template<1, 0>(in_vec.data(), in_vec.size(), p);
    } else if ( (op1 == 1) && (op2 == 1)) {
        modify_vec_template<1, 1>(in_vec.data(), in_vec.size(), p);
    }
}
Breno
  • 748
  • 8
  • 18

3 Answers3

2

First of all, I'd note that although you're currently using a separate loop for each operation, at least with the operations you're currently using, you don't really need a separate loop for each. You could turn them into a single loop, something like this:

    const auto n = inp_vec.size();
    for (auto i=0; i<n; ++i) {
        inp_vec[i] = inp_vec[i]*p[0]+3.;  // this is operation 1 in the MWE
        inp_vec[i] = inp_vec[i]/p[1]-4;
    }

If we want to make those generic, we might rewrite it to something like:

    const auto n = inp_vec.size();
    for (auto i=0; i<n; ++i) {
        inp_vec[i] = op1(inp_vec[i], p[0]);
        inp_vec[i] = op2(inp_vec[i], p[1]);
    }

Since he second is really just operating on the result of the first, we could go a (small) step further:

    const auto n = inp_vec.size();
    for (auto i=0; i<n; ++i) {
        inp_vec[i] = op2(op1(inp_vec[i], p[0]), p[1]);
    }

People who do functional programming call that "composing" the two operations. So we can start with a fairly simple template to compose a couple of operations:

template<class F1, class F2>
class compose {
    F1 one;
    F2 two;
  public:
    compose(F1 one,F2 two) : one(one), two(two) {}

    auto operator()(auto const &input) {
        return two(one(input));
    }
};

Nothing really special here. We just create an object that stores a couple of operations, and when we want to, we can get the result of applying one after another.

The next step that seems obvious (at least to me) would be to use something from the standard library that's already intended to do the sort of thing you seem to want: std::transform. We can use it to handle the looping part of things, so we and up with something like this:

    std::transform(input.begin(), input.end(), input.begin(),
        compose([a = p[0]](double x) { return x * a + 3; }, // op1
            [b = p[1]](double x) { return x / b - 4; }));   // op2

It's not clear to me whether we really need modify_vec as such at all any more. We could write one, that would just be a really thin wrapper around the code above if we wanted to though:

template <class Op1, class Op2>
void modify_vec(std::vector<double> &input, Op1 op1, Op2 op2)
{

    std::transform(input.begin(), input.end(), input.begin(),
        compose(op1, op2));
}

...which we'd call something like this:

modify_vec(input,
    [a = p[0]](double x) { return x * a + 3; },
    [b = p[1]](double x) { return x / b - 4; });

If we wanted to make it even more generic, we could change compose into a variadic template so it could take an arbitrary number of functions to apply, then use a fold expression to compose them. This would let us apply an arbitrary number of functions to each element of the collection. Not clear there's any interest in that though, so I'll leave it at just mentioning the possibility for now.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • The sequence of operations currently shown in the MWE are only illustrative. In the actual problem there are many other statements within each and between the loops. Thus, I cannot use the `compose` and `std::transform` ideas, but it's cool to learn. – Breno Jul 17 '23 at 09:34
  • Regarding the last part, the "operations" that should be used are specified at runtime, although all of the possible operations are written at compile time. If I use lambdas, like you wrote, they'll never be inlined, right? – Breno Jul 17 '23 at 09:37
  • 1
    @Breno: I'm not aware of anything about lambdas that would prevent inline code generation. If anything, rather the opposite: they tend to be easier to generate inline than if (for one obvious example) you use pointers to functions. – Jerry Coffin Jul 17 '23 at 15:50
1

You could make an array of function pointers that get selected via a non-type template parameter:

constexpr double first_op(const double input, const double p) {
    return input * p + 3.;
}
constexpr double second_op(const double input, const double p) {
    return input / p - 4.;
}

using op_fn = double(*)(const double input, const double p);

template<int N>
void modify_vec_gen(std::vector<double>& inp_vec, const std::vector<double>& p) {
    const static op_fn funcs[] = { first_op, second_op };

    const auto& op = funcs[N];
    const auto n = inp_vec.size();
    for (auto i = 0; i < n; ++i) {
        inp_vec[i] = op(inp_vec[i], p[0]);  
    }
    for (auto i = 0; i < n; ++i) {
        inp_vec[i] = op(inp_vec[i], p[1]);
    }
}

int main() {
    // ...
    modify_vec_gen<0>(inp, p);
}
jwezorek
  • 8,592
  • 1
  • 29
  • 46
  • Cool. I didn't know about non-type template parameters. I put the `funcs[]` array outside the function and did `template` instead of `template`. So now I can get any possible combinations of operations, as in the MWE. And this is inlined by the compiler! (I think, looking at the generated assembly code (which I cannot read very well)) – Breno Jul 17 '23 at 09:45
0

I have found a way to code this that works both for GCC and MSVC. It requires OpenMP for MSVC, though. GCC can do without OpenMP.

The idea is to make a constexpr function containing all possible smaller operations with if-else statements. The conditionals are evaluated using a non-type template parameter (could also be an enum, I think).

Then template the do_sth function with all distinct possible operations and use #pragma omp simd on the for loops. Each possible combination of operations has to be called in a helper/caller function call_do_sth.

This obviously doesn't scale to a large amount of different combinations, but it works if you keep it small (see the discussions in this Stack Overflow post about runtime specification of template parameters).

This can also be seen in Compiler Explorer.

The flags for GCC are: -O3 -ffast-math -mavx2 -fopenmp fopt-info-vec-all And the flags for MSVC are: /O2 /fp:fast /arch:AVX2 /openmp:experimental /Qvec-report:2

#ifdef _WIN32
#define MYRESTRICT __restrict
#else
#define MYRESTRICT __restrict__
#endif

template<int op>
constexpr double funs(const double val1, const double val2) noexcept {
    if (op == 0) { return val1*val2-10.; }
    else if (op == 1) { return val1/val2+4.; }
}

template<int op1, int op2>
void do_sth(double * MYRESTRICT arr, const std::size_t size,
            const double p) {
    std::size_t i;
    #pragma omp simd
    for (i = 0; i < size; ++i) {
            arr[i] = funs<op1>(arr[i], p);
    }
    #pragma omp simd
    for (i = 0; i < size; ++i) {
            arr[i] = funs<op2>(arr[i], p);
    }
}

void call_do_sth(double *arr, const std::size_t size,
                 const double p, const int op1, const int op2) {
    if ((op1 == 0) && (op2 == 0)) {
        do_sth<0, 0>(arr, size, p);
    } else if ((op1 == 0) && (op2 == 1)) {
        do_sth<0, 1>(arr, size, p);
    } else if ((op1 == 1) && (op2 == 0)) {
        do_sth<1, 0>(arr, size, p);
    } else if ((op1 == 1) && (op2 == 1)) {
        do_sth<1, 1>(arr, size, p);
    }
}
Breno
  • 748
  • 8
  • 18