I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible. Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3. Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.
One idea would be to use Eigen: Eigen::ArrayXd v; then simply apply a functor: v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});
Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.
Some considerations There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.
Question How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.