Performance bottlenecks in fast evaluation of trig functions using Eigen and MEX

Question

In a project using Matlab's C++ MEX API, I have to compute the value exp(j * 2pi * x) for over 100,000 values of x where x is always a positive double. I've written some helper functions that breakdown the computation into sin/cos using euler's formula. I then apply the method of range reduction to reduce my values to their corresponding points in the domain [0,T/4] where T is the period of the exponential I'm computing. I keep track of which quadrant in [0, T] the original value would have fallen into for later. I can then compute the trig function using a taylor series polynomial in horner form and apply the appropriate shift depending on which quadrant the original value was in. For further information on some of the concepts in this technique, check out this answer. Here is the code for this function:

Eigen::VectorXcd calcRot2(const Eigen::Ref<const Eigen::VectorXd>& idxt) {

    Eigen::VectorXd vidxt = idxt.array() - idxt.array().floor();
    Eigen::VectorXd quadrant = (vidxt.array()*2+0.5).floor();
    vidxt.array() -= (quadrant.array()*0.5);
    vidxt.array() *= 2*3.14159265358979;
    const Eigen::VectorXd sq = vidxt.array()*vidxt.array();

    Eigen::VectorXcd M(vidxt.size());
    M.real() = fastCos2(sq);
    M.imag() = fastSin2(vidxt,sq);
    M = (quadrant.array() == 1).select(-M,M);
    return M;
}

I profiled the code segment in which this function is called using std::chrono and averaged over 500 calls to the function (where each call to the mex function processes all 100,000+ values by calling calcRot2 in a loop. Each iteration passes about 200 values to calcRot2). I find the following average runtimes:

runtime with calcRot2:                   75.4694 ms
runtime with fastSin/Cos commented out:  50.2409 ms
runtime with calcRot2 commented out:     30.2547 ms

Looking at the difference between the two extreme cases, it seems like calcRot has a large contribution to the runtime. However, only a portion of that comes from the sin/cos calculation. I would assume Eigen's implicit vectorization and the compiler would make the runtime of the other operations in the function effectively negligible. (floor shouldn't be a problem!) Where exactly is the performance bottleneck here?

This is the compilation command I'm performing (It uses MinGW64 which I think is the same as gcc):

mex(ipath,'CFLAGS="$CFLAGS -O3 -fno-math-errno -ffast-math -fopenmp -mavx2"','LDFLAGS="$LDFLAGS -fopenmp"','DAS.cpp','DAShelper.cpp')

Reference Code

For reference, here is the code segment in the main mex function where the timer is called, and the helper function that calls calcRot2():

MEX function call:

chk1 = std::chrono::steady_clock::now();
// Calculate beamformed signal at each point
Eigen::MatrixXcd bfVec(p.nPoints,1);
#pragma omp parallel for
for (int i = 0; i < p.nPoints; i++) {
    calcPoint(idxt.col(i),SIG,p,bfVec(i));
}
chk2 = std::chrono::steady_clock::now();
auto diff3 = chk2 - chk1;

calcPoint:

void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
               const Eigen::Ref<const Eigen::MatrixXcd>& SIG,
               Parameters& p, std::complex<double>& bfVal) {
    
Eigen::VectorXcd pRot = calcRot2(idxt*p.fc/p.fs);

int j = 0;
for (auto x : idxt) {
    if(x >= 0) {
        int vIDX = static_cast<int>(x);
        bfVal += (SIG(vIDX,j)*(vIDX + 1 - x) + SIG(vIDX+1,j)*(x - vIDX))*pRot(j);
    }
    j++;
}
}

Clarification

To clarify, the line

(vidxt.array()*2+0.5).floor()

is meant to yield:

0 if vidxt is between [0,0.25] 
1 if vidxt is between [0.25,0.75]
2 if vidxt is between [0.75,1]

The idea here is that when vidxt is in the second interval (quadrants 2 and 3 on the unit circle for functions with period 2pi), then the value needs to map to its negative value. Otherwise, the range reduction maps the values to the correct values.

Calling `floor` is not the same as casting to int with flooring. The function also does a bunch of array operations. We don’t know what size array this is, but those are obviously not for free. — Cris Luengo, Dec 27 '21 at 17:58
Actually you might be better off writing a loop over your array and doing the full computation for each array element. That way you don’t iterate over the array over and over again, which is bad for cache usage. — Cris Luengo, Dec 27 '21 at 18:02
In a previous implementation, I iterated over each array element and performed the full computation. The performance was a little worse, about 90 ms in total. I switched to this approach hoping that using Eigen would allow the compiler to vectorize. Also, in that previous implmentation, I was using static cast to int instead of flooring. There was little difference between static cast and floor. When using casting and flooring with Eigen however, casting takes much longer for some reason. — drakon101, Dec 27 '21 at 18:27
For fast sine+cosine, see there https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/DirectXMathVector.inl#L4374 Exponent there: https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/DirectXMathVector.inl#L3393-L3413 That code is for FP32 vectors, but should be relatively easily portable to FP64. — Soonts, Dec 29 '21 at 11:29
In your clarification, `0.25` and `0.75` are each part of two separate inclusive ranges. (And they are exactly representable as binary floats, so it does matter what happens to them). Did you maybe mean `[0, 0.25)` or `(0.25, 0.75]` or `(0.25, 0.75)`? Or you don't care which of the two adjacent indexes you get for those bordeline cases, take whichever allows the most optimization? — Peter Cordes, Jan 05 '22 at 10:44
That's correct, I just take whichever allows more optimization. I figure for float or double input values of idxt, the number of instances where quadrant is exactly on the borderline will be low, so an incorrect assumption here or there will be negligible. — drakon101, Jan 06 '22 at 19:14

Homer512 · Accepted Answer · 2021-12-29T13:59:28.900

The benefits of Eigen's vectorization are outweighed because you evaluate your expressions into temporary vectors. Allocating, deallocating, filling and reading these vectors has cost that seems significant. This is especially so because the expressions themselves are relatively simple (just a few scalar operations).

Expression objects

What usually helps here is aggregating into fewer expressions. For example line 3 and 4 can be collapsed into one: vidxt.array() = 2*3.14159265358979 * (vidxt.array() - quadrant.array()*0.5); (BTW: Note that that math.h contains a constant M_PI with pi in double precision).

Beyond that, Eigen expressions can be combined and reused. Something like this:

    auto vidxt0 = idxt.array() - idxt.array().floor();
    auto quadrant = (vidxt0*2+0.5).floor();
    auto vidxt = 2*3.14159265358979 * (vidxt0 - quadrant.array()*0.5);
    auto sq = vidxt.array().square();

    Eigen::VectorXcd M(vidxt.size());
    M.real() = fastCos2(sq);
    M.imag() = fastSin2(vidxt,sq);
    M = (quadrant.array() == 1).select(-M,M);

Note that none of the auto values are vectors. They are expression objects that behave like arrays and can be evaluated into vectors or arrays.

You can pass these on to your fastCos2 and fastSin2 function by declaring them as templates. The typical Eigen pattern would be something like

template<Derived>
void fastCos2(const Eigen::ArrayBase<Derived>& sq);

The idea here is that ultimately, everything compiles into one huge loop that gets executed when you evaluate the expression into a vector or array. If you reference the same sub-expression multiple times, the compiler may be able to eliminate the redundant computations.

Unfortunately, I could not get any better performance out of this particular code, so it is no real help here but it is still something worth exploring in these kind of cases.

fastSin/Cos return value

Speaking of temporary vectors: You didn't include the code for your fastSin/Cos functions but it looks a lot like you return a temporary vector which is then copied into the real and imaginary parts or the actual return value. This is another temporary that you may want to avoid. Something like this:

template<class Derived1, class Derived2>
void fastCos2(const Eigen::MatrixBase<Derived1>& M, const Eigen::MatrixBase<Derived2>& sq)
{
    Eigen::MatrixBase<Derived1>& M_mut = const_cast<Eigen::MatrixBase<Derived1>&>(M);
    M_mut = sq...;
}

fastCos2(M.real(), sq);

Please refer to Eigen's documentation on the topic of function arguments.

The downside of this approach in this particular case is that now the output is not consecutive (real and imaginary parts are interleaved). This may affect vectorization negatively. You may be able to work around this by combining the sin and cos functions into one expression for both. Benchmarking is required.

Using a plain loop

As others have pointed out, using a loop may be easier in this particular case. You noted that this was slower. I have a theory why: You did not specify -DNDEBUG in your compile options. If you don't, all array indices in Eigen vectors are range-checked with an assertion. These cost time and prevent vectorization. If you include this compile flag, I find my code significantly faster than using Eigen expressions.

Alternatively, you can use raw C pointers to the input and output vector. Something like this:

std::ptrdiff_t n = idxt.size();
Eigen::VectorXcd M(n);
const double* iidxt = idxt.data();
std::complex<double>* iM = M.data();
for(std::ptrdiff_t j = 0; j < n; ++j) {
    double ival = iidxt[j];
    double vidxt = ival - std::floor(ival);
    double quadrant = std::floor(vidxt * 2. + 0.5);
    vidxt = (vidxt - quadrant * 0.5) * (2. * 3.14159265358979);
    double sq = vidxt * vidxt;
    // stand-in for sincos
    std::complex<double> jval(sq, vidxt + sq);
    iM[j] = quadrant == 1. ? -jval : jval;
}

Fixed sized arrays

To avoid the cost of memory allocation and make it easier for the compiler to avoid memory operations in the first place, it can help to run the computation on blocks of fixed size. Something like this:


std::ptrdiff_t n = idxt.size();
Eigen::VectorXcd M(n);
std::ptrdiff_t i;
for(i = 0; i + 4 <= n; i += 4) {
    Eigen::Array4d idxt_i = idxt.segment<4>(i);
    ...
    M.segment<4>(i) = ...;
}
if(i + 2 <= n) {
    Eigen::Array2D idxt_i = idxt.segment<2>(i);
    ...
    M.segment<2>(i) = ...;
    i += 2;
}
if(i < n) {
    // last index scalar
}

This kind of stuff needs careful tuning to ensure that vectorized code is generated and there are no unnecessary temporary values on the stack. If you can read assembler, Godbolt is very helpful.

Other remarks

Eigen includes vectorized versions of sin and cos. Have you compared your code to these instead of e.g. Eigen's complex exp function?
Depending on your math library, there is also an explicit sincos function to compute sine and cosine in one function. It is not vectorized but still saves time on range reduction. You can (usually) access it through std::polar. Try this:

Eigen::VectorXd scale = ...;
Eigen::VectorXd phase = ...;
// M = scale * exp(-2 pi j phase)
Eigen::VectorXd M = scale.binaryExpr(-2. * M_PI * phase,
      [](double s, double p) noexcept -> std::complex<double> {
          return std::polar(s, p);
});

If your goal is an approximation instead of a precise result, shouldn't your first step be to cast to single precision? Maybe after the range reduction to avoid losing too many decimal places. At the very least it will double the work done per clock cycle. Also, regular sine and cosine implementations take less time in float.

Edit

I had to correct myself on the cast to int64 instead of int. There is no vectorized conversion to int64_t until AVX512
The line (vidxt.array()*2+0.5).floor() bugs me slightly. This is meant to round down to negative infinity for [0, 0.5) and up to positive infinity for [0.5, 1), correct? vidxt is never negative. Therefore this line should be equivalent to (vidxt.array()*2).round(). With AVX2 and -ffast-math that saves one instruction. With SSE2 none of these actually vectorize, as can be seen on Godbolt

Thanks for the in depth response! I'm still working through some of these suggestions, I'll come back with an update once I'm done. — drakon101, Dec 28 '21 at 21:26
For double -> int64_t, see Mysticial's hand-rolled emulation of that: [How to efficiently perform double/int64 conversions with SSE/AVX?](https://stackoverflow.com/q/41144668) — Peter Cordes, Dec 29 '21 at 18:37
Since I'm using MEX, would godbolt accurately reflect the assembly code produced for any code fragments I apply it to? I assume it should since the MEX command isn't a compiler itself, and in my case it uses GCC as the compiler. — drakon101, Jan 04 '22 at 23:58
Yep, should work. The ABI for function calls may be different but the loop body should be basically the same — Homer512, Jan 05 '22 at 00:33
I'm still picking up godbolt, but what makes you say that none of those operations vectorize according to your linked example? I see SSE instructions (albeit not highlighted) when using -msse2. I assume they are not highlighted because this code snippet does not actually call the functions, therefore the return line isn't evaluated? — drakon101, Jan 08 '22 at 00:21
@drakon101 You have to look at the kind of instruction. For example in the last link in my answer, in the middle tab (SSE2), the main loop of the first function is at ```.L7```, Most instructions have the suffix "sd" meaning single double. Compare this to the AVX2 case where you find the suffix "pd" for packed double. There are a few packed double instructions in SSE2 (like ```andpd```) but those seem to run without initializing the upper half of the vector register first. I guess there are performance benefits to using them over their single double variants. — Homer512, Jan 08 '22 at 09:48
@drakon101 And yeah, the color highlighting just doesn't seem two work in this particular case. Godbolt usually has a hard time associating lines across complex codes with high optimizations. — Homer512, Jan 08 '22 at 09:53