SSE performance Eigen

Question

I'm trying to write some SSE code using Eigen, and some behavior eludes me.

Given code:

#ifndef EIGEN_DONT_VECTORIZE // Not needed with Intel C++ Compiler XE 15.0
#define EIGEN_VECTORIZE_SSE4_2
#define EIGEN_VECTORIZE_SSE4_1
#define EIGEN_VECTORIZE_SSSE3
#define EIGEN_VECTORIZE_SSE3
#endif

#include "stdafx.h"
#include <iostream>
#include <unsupported/Eigen/AlignedVector3>
#include <Eigen/StdVector>
#include <chrono>

int _tmain(int argc, _TCHAR* argv[]) {
    static const int SIZE = 4000000;
    EIGEN_ALIGNED_VECTOR3 Eigen::AlignedVector3<float> A_SSE(1, 1, 1);
    //EIGEN_ALIGNED_VECTOR3 Eigen::AlignedVector3<float> B_SSE(2, 2, 2);
    //std::vector<Eigen::AlignedVector3<float>> C_SSE(SIZE, Eigen::AlignedVector3<float>(0,0,0));


    EIGEN_ALIGNED_VECTOR3 Eigen::AlignedVector3<float> A_SSE1(1, 1, 1);
    EIGEN_ALIGNED_VECTOR3 Eigen::AlignedVector3<float> A_SSE2(1, 1, 1);
    EIGEN_ALIGNED_VECTOR3 Eigen::AlignedVector3<float> A_SSE3(1, 1, 1);
    EIGEN_ALIGNED_VECTOR3 Eigen::AlignedVector3<float> A_SSE4(1, 1, 1);

    EIGEN_ALIGNED_VECTOR3 Eigen::AlignedVector3<float> B_SSE(2, 2, 2);
    EIGEN_ALIGNED_VECTOR3 Eigen::AlignedVector3<float> B_SSE_increment_unroll(16, 16, 16);

    A_SSE2 += B_SSE;
    A_SSE3 = A_SSE2 + B_SSE;
    A_SSE4 = A_SSE3 + B_SSE;


    std::vector<Eigen::AlignedVector3<float>> C_SSE(SIZE, Eigen::AlignedVector3<float>(0, 0, 0));

    auto start2 = std::chrono::system_clock::now();

    // no unroll
    for (int iteration = 0; iteration < SIZE; ++iteration) {
        A_SSE += B_SSE;
        C_SSE[iteration] = A_SSE;
    }

    //// own unroll
    //for (int iteration = 0; iteration < SIZE / 8; ++iteration){
    //  A_SSE1 += B_SSE_increment_unroll;
    //  A_SSE2 += B_SSE_increment_unroll;
    //  A_SSE3 += B_SSE_increment_unroll;
    //  A_SSE4 += B_SSE_increment_unroll;

    //  C_SSE[iteration * 2] = A_SSE1;
    //  C_SSE[iteration * 2 + 1] = A_SSE2;
    //  C_SSE[iteration * 2 + 2] = A_SSE3;
    //  C_SSE[iteration * 2 + 3] = A_SSE4;

    //}

    auto end2 = std::chrono::system_clock::now();
    auto elapsed2 = end2 - start2;
    std::cout << "Eigen aligned vector " << elapsed2.count() << '\n';

    Eigen::Matrix3Xf A = Eigen::Matrix3Xf::Zero(3, SIZE);
    Eigen::Vector3f B(3, 3, 3);
    Eigen::Vector3f C(2, 2, 2);

    auto start1 = std::chrono::system_clock::now();

    for (int iteration = 0; iteration < SIZE; ++iteration) {
        B += C;
        A.col(iteration) = B;
    }
    auto end1 = std::chrono::system_clock::now();
    auto elapsed1 = end1 - start1;
    std::cout << "Eigen matrix " << elapsed1.count() << '\n';


    float *pResult = (float*)_aligned_malloc(SIZE * sizeof(float) * 4, 16); // align to 16-byte for SSE
    auto start3 = std::chrono::system_clock::now();

    __m128 x;
    __m128 xDelta = _mm_set1_ps(2.0f);      // Set the xDelta to (4,4,4,4)
    __m128 *pResultSSE = (__m128*) pResult;

    x = _mm_set_ps(1.0f, 1.0f, 1.0f, 1.0f); // Set the initial values of x to (4,3,2,1)

    for (int iteration = 0; iteration < SIZE; ++iteration)
    {
        x = _mm_add_ps(x, xDelta);
        pResultSSE[iteration] = x;
    }

    auto end3 = std::chrono::system_clock::now();
    auto elapsed3 = end3 - start3;
    std::cout << "Own sse " << elapsed3.count() << '\n';

}

Timing seems odd, on my pc

Eigen Aligned Vector Unroll: 20057
Eigen Align Vector no unroll: ~120320
Eigen Matrix: ~120207 ( same as Align no unroll)
Own SSE: 160784

When I examine assembly, aligned versions and Own SSE use addps movaps, but until i manually unroll loops I don't gain additional performance, and even if I do it not in all runs (50%) I don't get any boost. Version wit Eigen Matrix don't use sse, achieve same performance, inline assembly shows unrolling on 16 iterations. Does manual unrolling is that impactful? Should we manually do it for SSE, and If on with CPU properties it depends?

Edit: So to sum up. SSE instruction do not perform better because of not able to prove that unrolling loop will hold same result as not unrolled, so it can not hide memory storage latency. But in assembly code "single" instructions are using only 1 register and incrementing it in unrolled loop. If the SSE addiction is performed vertically (single float in aligned vector accumulates same amount of operation of addition) compiler should be able to prove equality for unrolling. Are SSE operation by default non optimized by compiler? If unrolling loop preserve order of execution, so preserve non associative math, automatic unrolling should be possible so why it does not happen, and how to force compiler to do it?

EDIT: As suggested I run test, but bench unit from eigen do not work under visual studio 2017 so it was replaced by

#include <iostream>
#include <vector>
#include <unsupported/Eigen/AlignedVector3>
#include <chrono>
#include <numeric>

EIGEN_DONT_INLINE
void vector_no_unroll(std::vector<Eigen::AlignedVector3<float>>& out)
{
    Eigen::AlignedVector3<float> A_SSE(1, 1, 1);
    Eigen::AlignedVector3<float> B_SSE(2, 2, 2);
    for (auto &x : out)
    {
        A_SSE += B_SSE;
        x = A_SSE;
    }
}

EIGEN_DONT_INLINE
void vector_unrolled(std::vector<Eigen::AlignedVector3<float>>& out)
{
    Eigen::AlignedVector3<float> A_SSE1(1, 1, 1);
    Eigen::AlignedVector3<float> A_SSE2(1, 1, 1);
    Eigen::AlignedVector3<float> A_SSE3(1, 1, 1);
    Eigen::AlignedVector3<float> A_SSE4(1, 1, 1);

    Eigen::AlignedVector3<float> B_SSE(2, 2, 2);
    Eigen::AlignedVector3<float> B_SSE_increment_unroll(16, 16, 16);

    A_SSE2 += B_SSE;
    A_SSE3 = A_SSE2 + B_SSE;
    A_SSE4 = A_SSE3 + B_SSE;
    for (size_t i = 0; i<out.size(); i += 4)
    {
        A_SSE1 += B_SSE_increment_unroll;
        A_SSE2 += B_SSE_increment_unroll;
        A_SSE3 += B_SSE_increment_unroll;
        A_SSE4 += B_SSE_increment_unroll;
        out[i + 0] = A_SSE1;
        out[i + 1] = A_SSE2;
        out[i + 2] = A_SSE3;
        out[i + 3] = A_SSE4;
    }
}

EIGEN_DONT_INLINE
void eigen_matrix(Eigen::Matrix3Xf& out)
{
    Eigen::Vector3f B(1, 1, 1);
    Eigen::Vector3f C(2, 2, 2);

    for (int i = 0; i < out.cols(); ++i) {
        B += C;
        out.col(i) = B;
    }
}

template<int unrolling> EIGEN_DONT_INLINE
void eigen_matrix_unrolled(Eigen::Matrix3Xf& out)
{
    Eigen::Matrix<float, 3, unrolling> B = Eigen::Matrix<float, 1, unrolling>::LinSpaced(3.f, 1 + 2 * unrolling).template replicate<3, 1>();

    for (int i = 0; i < out.cols(); i += unrolling) {
        out.middleCols<unrolling>(i) = B;
        B.array() += float(2 * unrolling);
    }
}

int main() {
    static const int SIZE = 4000000;

    int tries = 30;
    int rep = 10;


    std::vector<int> Timings(tries, 0);
    {
        Eigen::Matrix3Xf A(3, SIZE);
#pragma loop( 1 )
        for (int iter = 0; iter < tries; ++iter)
        {
            auto start1 = std::chrono::system_clock::now();
            eigen_matrix(A);
            Timings[iter] = (std::chrono::system_clock::now() - start1).count();
        }
    }
    std::cout << "eigen matrix Min: " << *std::min_element(Timings.begin(), Timings.end()) << " ms\n";
    std::cout << "eigen matrix Mean: " << std::accumulate(Timings.begin(), Timings.end(), 0) / tries << " ms\n";

    {
        Eigen::Matrix3Xf A(3, SIZE);
#pragma loop( 1 )
        for (int iter = 0; iter < tries; ++iter)
        {
            auto start1 = std::chrono::system_clock::now();
            eigen_matrix_unrolled<4>(A);
            Timings[iter] = (std::chrono::system_clock::now() - start1).count();
        }
    }
    std::cout << "eigen matrix unrolled 4 min: " << *std::min_element(Timings.begin(), Timings.end()) << " ms\n";
    std::cout << "eigen matrix unrolled 4 Mean: " << std::accumulate(Timings.begin(), Timings.end(), 0) / tries << " ms\n";

    {
        Eigen::Matrix3Xf A(3, SIZE);
#pragma loop( 1 )
        for (int iter = 0; iter < tries; ++iter)
        {
            auto start1 = std::chrono::system_clock::now();
            eigen_matrix_unrolled<8>(A);
            Timings[iter] = (std::chrono::system_clock::now() - start1).count();
        }
    }
    std::cout << "eigen matrix unrolled 8 min: " << *std::min_element(Timings.begin(), Timings.end()) << " ms\n";
    std::cout << "eigen matrix unrolled 8 Mean: " << std::accumulate(Timings.begin(), Timings.end(), 0) / tries << " ms\n";

    {
        std::vector<Eigen::AlignedVector3<float>> A(SIZE, Eigen::AlignedVector3<float>(0, 0, 0));
#pragma loop( 1 )
        for (int iter = 0; iter < tries; ++iter)
        {
            auto start1 = std::chrono::system_clock::now();
            vector_no_unroll(A);
            Timings[iter] = (std::chrono::system_clock::now() - start1).count();
        }
    }
    std::cout << "eigen vector min: " << *std::min_element(Timings.begin(), Timings.end()) << " ms\n";
    std::cout << "eigen vector Mean: " << std::accumulate(Timings.begin(), Timings.end(), 0) / tries << " ms\n";

    {
        std::vector<Eigen::AlignedVector3<float>> A(SIZE, Eigen::AlignedVector3<float>(0, 0, 0));
#pragma loop( 1 )
        for (int iter = 0; iter < tries; ++iter)
        {
            auto start1 = std::chrono::system_clock::now();
            vector_unrolled(A);
            Timings[iter] = (std::chrono::system_clock::now() - start1).count();
        }
    }
    std::cout << "eigen vector unrolled min: " << *std::min_element(Timings.begin(), Timings.end()) << " ms\n";
    std::cout << "eigen vector unrolled Mean: " << std::accumulate(Timings.begin(), Timings.end(), 0) / tries << " ms\n";

}

And checked the results on 8 diffrent machines (all windows) and get following results

eigen matrix Min: 110477 ms

eigen matrix Mean: 131691 ms

eigen matrix unrolled 4 min: 40099 ms

eigen matrix unrolled 4 Mean: 54812 ms

eigen matrix unrolled 8 min: 40001 ms

eigen matrix unrolled 8 Mean: 51482 ms

eigen vector min: 100270 ms

eigen vector Mean: 117316 ms

eigen vector unrolled min: 59966 ms

eigen vector unrolled Mean: 65847 ms

On every machine I tested, exepted one with was the oldest. Looks like on new machines small unrolling can be quite beneficial ( results differs form 1.5 to 3.5 times speed up on 4x unrolled and do not incrise even if unrolling was for 8,16,32, or 256 time).

*All* optimisation should be *per-CPU*. It all comes back to the root question: Is it too slow? If the answer is *yes*, then we've avoid premature optimisation. Do you think this might be a premature optimisation? Suppose, is it Intel SSE or C++ SSE? — autistic, Sep 25 '17 at 14:22
Its not about optimise these rutine, but using sse in general. By optimisation per cpu, you mean for example count xmm registers and unroll to use all? — CzakCzan, Sep 25 '17 at 14:31
You unroll to hide ADDPS or FMA latency, as well as to avoid front-end bottlenecks on loop overhead. e.g. for a dot-product: https://stackoverflow.com/questions/45113527/why-does-mulss-take-only-3-cycles-on-haswell-different-from-agners-instruction. — Peter Cordes, Sep 25 '17 at 14:47
And yes, tuning loops is somewhat specific to the microarchitecture. Unrolling enough to use all the vector registers is usually overkill, and the code size isn't always worth it. (Also, big unrolls make the startup/cleanup more work, since there can be more iterations left over after the unrolled loop.) — Peter Cordes, Sep 25 '17 at 14:50
Your manually unrolled code is not equivalent to the original code, since floating point math is (generally) non-associative. In your specific case unrolling actually increases the accuracy. — chtz, Sep 25 '17 at 15:23
@Sebivor Your comment can actually be applied to pretty much every single question in the SSE/AVX/AVX512 tags. So I'd say it goes without saying that anybody who knowing includes those tags are probably already past the "is it premature optimization?" phase. — Mysticial, Sep 25 '17 at 17:26
@PeterCordes Unrolling to use all registers has a side-effect of clobbering all the callee-save registers (in Windows). That alone is enough for me to not recommend it unless it's absolutely necessary. — Mysticial, Sep 25 '17 at 17:32
@Mysticial: oh yeah, hahaha Windows. /facepalm Good point about that ABI. I think 2 to 4 call-preserved xmm registers would have been a good design (e.g. xmm6-9, two non-REX and two REX), not all-but-5. It seems to me that functions which use a couple `double` variables and make function calls often only have a could FP variables that would benefit from staying live. Not having *any* call-preserved XMM regs is a downside for the System V ABI, though, but Windows goes *way* too far in the other direction. — Peter Cordes, Sep 25 '17 at 17:38
@PeterCordes Intel made it impossible to callee-save vector registers now. (There are no instructions to load/store the entire register of unknown length other than XSAVE for context switches.) TBH, I favor this decision since in most code, all the SIMD usage will be concentrated in the leaf functions of a call tree. The callers generally have nothing to save anyway so making them save all the registers is a no-op. — Mysticial, Sep 25 '17 at 18:16
@Mysticial: I said XMM intentionally, for use with scalar `double` / `float`. Maybe even just the low half of a couple XMM regs. As you say, SIMD code (including 128bit) is almost always in leaf functions. A couple call-preserved regs would probably save code-size overall, especially when code calls non-inline math library functions. AVX code just needs a couple `vmovsd` save/restore instructions to free up the last couple regs if it needs all of them. — Peter Cordes, Sep 25 '17 at 18:37
@Mysticial I suggest reading the elements of this question which you most identify as "questions". Let us not focus on my questions, as they are directed at the OP. Let us first focus on the OPs "questions". He's asking 1/ Does manual unrolling have an impact? and 2/ Should we manually unroll when the CPU supports SSE? He's asking this in the context of a... *theoretical exercise*, I guess, involving code which just times some SSE operations, and no realistic context. Are *you* going to answer those two questions and tell me your answer has *nothing* to do with premature optimisation? — autistic, Sep 25 '17 at 19:30
@Sebivor Based on your complaints about this question, I think you should just vote to close instead of complaining to me. — Mysticial, Sep 25 '17 at 19:38
@Mysticial I was just responding to your ... response directed at me. There's no need to get snappy! It's good to see you're not insane, though... wait... you're not insane, right? — autistic, Sep 25 '17 at 21:03
@Sebivor If you have a problem with me or any other users, please take it to [Meta.SO](https://meta.stackoverflow.com/). This isn't the place for this. — Mysticial, Sep 25 '17 at 21:22
@CzakCzan Please edit your code to provide an actual [mcve] (including all necessary `#include` and a runnable `main` function). Also, please check the timings you stated (it seems you either forgot a 0 or added an extra 0). — chtz, Sep 25 '17 at 22:21

score 2 · Accepted Answer · answered Sep 27 '17 at 21:55

Your timings are very inaccurate (when running your code multiple times, I'm getting a lot of variation). For better reproducibility you should run each variant multiple times and take the minimal time. I put together a benchmark using the BenchUtils which are part of Eigen:

#include <iostream>
#include <unsupported/Eigen/AlignedVector3>
#include <bench/BenchUtil.h>

EIGEN_DONT_INLINE
void vector_no_unroll(std::vector<Eigen::AlignedVector3<float>>& out)
{
    Eigen::AlignedVector3<float> A_SSE(1, 1, 1);
    Eigen::AlignedVector3<float> B_SSE(2, 2, 2);
    for(auto &x : out)
    {
        A_SSE += B_SSE;
        x = A_SSE;
    }
}

EIGEN_DONT_INLINE
void vector_unrolled(std::vector<Eigen::AlignedVector3<float>>& out)
{
    Eigen::AlignedVector3<float> A_SSE1(1, 1, 1);
    Eigen::AlignedVector3<float> A_SSE2(1, 1, 1);
    Eigen::AlignedVector3<float> A_SSE3(1, 1, 1);
    Eigen::AlignedVector3<float> A_SSE4(1, 1, 1);

    Eigen::AlignedVector3<float> B_SSE(2, 2, 2);
    Eigen::AlignedVector3<float> B_SSE_increment_unroll(16, 16, 16);

    A_SSE2 += B_SSE;
    A_SSE3 = A_SSE2 + B_SSE;
    A_SSE4 = A_SSE3 + B_SSE;
    for(size_t i=0; i<out.size(); i+=4)
    {
        A_SSE1 += B_SSE_increment_unroll;
        A_SSE2 += B_SSE_increment_unroll;
        A_SSE3 += B_SSE_increment_unroll;
        A_SSE4 += B_SSE_increment_unroll;
        out[i + 0] = A_SSE1;
        out[i + 1] = A_SSE2;
        out[i + 2] = A_SSE3;
        out[i + 3] = A_SSE4;
    }
}

EIGEN_DONT_INLINE
void eigen_matrix(Eigen::Matrix3Xf& out)
{
    Eigen::Vector3f B(1, 1, 1);
    Eigen::Vector3f C(2, 2, 2);

    for (int i = 0; i < out.cols(); ++i) {
        B += C;
        out.col(i) = B;
    }
}

template<int unrolling> EIGEN_DONT_INLINE
void eigen_matrix_unrolled(Eigen::Matrix3Xf& out)
{
    Eigen::Matrix<float,3,unrolling> B = Eigen::Matrix<float, 1, unrolling>::LinSpaced(3.f, 1+2*unrolling).template replicate<3,1>();

    for (int i = 0; i < out.cols(); i+=unrolling) {
        out.middleCols<unrolling>(i) = B;
        B.array() += float(2*unrolling);
    }
}

int main() {
    static const int SIZE = 4000000;

    int tries = 10;
    int rep = 10;
    BenchTimer t;

    std::cout.precision(4);
    {
        std::vector<Eigen::AlignedVector3<float>> A(SIZE, Eigen::AlignedVector3<float>(0, 0, 0));
        BENCH(t, tries, rep, vector_no_unroll(A));
        std::cout << "no unroll:    " << 1e3*t.best(CPU_TIMER) << "ms\n";
    }
    {
        std::vector<Eigen::AlignedVector3<float>> A(SIZE, Eigen::AlignedVector3<float>(0, 0, 0));
        BENCH(t, tries, rep, vector_unrolled(A));
        std::cout << "unrolled:     " << 1e3*t.best(CPU_TIMER) << "ms\n";
    }
    {
        Eigen::Matrix3Xf A(3, SIZE);
        BENCH(t, tries, rep, eigen_matrix(A));
        std::cout << "eigen matrix: " << 1e3*t.best(CPU_TIMER) << "ms\n";
    }
    {
        Eigen::Matrix3Xf A(3, SIZE);
        BENCH(t, tries, rep, eigen_matrix_unrolled<4>(A));
        std::cout << "eigen unrd<4>: " << 1e3*t.best(CPU_TIMER) << "ms\n";
    }
    {
        Eigen::Matrix3Xf A(3, SIZE);
        BENCH(t, tries, rep, eigen_matrix_unrolled<8>(A));
        std::cout << "eigen unrd<8>: " << 1e3*t.best(CPU_TIMER) << "ms\n";
    }
}

I'm getting quite similar times nearly independent of compiling with -msse2, -msse4.2 or -mavx2:

no unroll:    66.72ms
unrolled:     66.83ms
eigen matrix: 57.56ms
eigen unrd<4>: 50.39ms
eigen unrd<8>: 51.19ms

Noticeably, the AligenedVector3 variants are consistently the slowest, with no significant difference between unrolling or not. The matrix variant takes about 7/8 the time, manually unrolling the matrix variant (to work on 4 or 8 columns per iteration), reduces the time to roughly 3/4 of the original time.

This indicates that memory bandwidth is likely the bottleneck for all vectorized variants. The unrolled matrix variant might be limited by the actual operations (or the manual copying of individual scalars).

Benchmarks were made on a Intel Core i5-4210U CPU @1.70GHz using g++5.4.1 on Ubuntu 16.04 with a relatively recent checkout of the Eigen development branch.

Thanks for full answer, but is there any way to check for memory bandwith limit? Is there any "rule of thumb" to decide if go to use SSE, or its so specific to machine, than it's imposible to exclude some cases from start? Honesty even if i replace simple addition with sqrt, I do not get any speedup on Intel core i7-7700HQ @2.8GHZ on windows 10. — CzakCzan, Sep 29 '17 at 10:25
Regarding memory bandwidth check, e.g.: https://stackoverflow.com/questions/3386042/how-to-measure-memory-bandwidth-utilization-on-windows If you don't have any data dependency between loops, I don't think manually unrolling gives any significant benefits. — chtz, Sep 29 '17 at 11:39
But of course, there may be some (old or low-level) CPUs where branch prediction is so bad that partial loop-unrolling will add performance (it will also increase instruction cache usage, of course ...) — chtz, Sep 29 '17 at 11:41
Loop unrolling helps (by reducing loop overhead: pointer increments and branch instructions) if front-end throughput is a bottleneck (e.g. with data hot in L1 or L2 cache). This is easily possible with load + ALU + store loops, especially with gcc which defaults to not unrolling at all (except with profile-guided optimization or (not recommended) `-funroll-all-loops`.) clang does unroll small loops by default. — Peter Cordes, Oct 03 '17 at 19:14
@PeterCordes Indeed. I played along with some smaller vector sizes and suddenly also compiling with AVX does make a significant difference (for the `eigen unrd<8>` variant) — chtz, Oct 03 '17 at 20:58

SSE performance Eigen

1 Answers1