Simple C++ expression templates wrapping intrinsics produces different instructions

Question

I am testing out a very simple program that is using C++ expression templates to simplify writing SSE2 and AVX code that operates on arrays of values.

I have a class svec which represents an array of values.

I have a class sreg that represents an SSE2 double register.

I have expr and add_expr representing the addition of svec arrays.

The compiler produces three extra instructions per loop for my expression template test case compared to hand rolled code. I was wondering if there is a reason for this, or any changes I can make to get he compiler to produce the same output?

The full test harness is:

#include <iostream>
#include <emmintrin.h>

struct sreg
{
    __m128d reg_;

    sreg() {}

    sreg(const __m128d& r) :
        reg_(r)
    {
    }

    sreg operator+(const sreg& b) const
    {
        return _mm_add_pd(reg_, b.reg_);
    }
};

template <typename T>
struct expr
{
    sreg operator[](std::size_t i) const
    {
        return static_cast<const T&>(*this).operator[](i);
    }

    operator const T&() const
    {
        return static_cast<const T&>(*this);
    }
};

template <typename A, typename B>
struct add_expr : public expr<add_expr<A, B>>
{
    const A& a_;
    const B& b_;

    add_expr(const A& a, const B& b) :
        a_{ a }, b_{ b }
    {
    }

    sreg operator[](std::size_t i) const
    {
        return a_[i] + b_[i];
    }
};

template <typename A, typename B>
inline auto operator+(const expr<A>& a, const expr<B>& b)
{
    return add_expr<A, B>(a, b);
}

struct svec : public expr<svec>
{
    sreg* regs_;
    std::size_t size_;

    svec(std::size_t size) :
        size_{ size }
    {
        regs_ = static_cast<sreg*>(_aligned_malloc(size * 32, 32));
    }

    ~svec()
    {
        _aligned_free(regs_);
    }

    template <typename T>
    svec& operator=(const T& expression)
    {
        for (std::size_t i = 0; i < size(); i++)
        {
            regs_[i] = expression[i];
        }

        return *this;
    }

    const sreg& operator[](std::size_t index) const
    {
        return regs_[index];
    }

    sreg& operator[](std::size_t index)
    {
        return regs_[index];
    }

    std::size_t size() const
    {
        return size_;
    }
};

static constexpr std::size_t size = 64;

int main()
{
    svec a(size);
    svec b(size);
    svec c(size);
    svec d(size);
    svec vec(size);

    //hand rolled loop
    for (std::size_t j = 0; j < size; j++)
    {
        vec[j] = a[j] + b[j] + c[j] + d[j];
    }

    //expression templates version of hand rolled loop
    vec = a + b + c + d;

    std::cout << "Done...";

    std::getchar();

    return EXIT_SUCCESS;
}

For the hand rolled loop the instructions are:

00007FF621CD1B70  mov         r8,qword ptr [c]  
00007FF621CD1B75  mov         rdx,qword ptr [b]  
00007FF621CD1B7A  mov         rax,qword ptr [a]  
00007FF621CD1B7F  vmovupd     xmm0,xmmword ptr [rcx+rax]  
00007FF621CD1B84  vaddpd      xmm1,xmm0,xmmword ptr [rdx+rcx]  
00007FF621CD1B89  vaddpd      xmm3,xmm1,xmmword ptr [r8+rcx]  
00007FF621CD1B8F  lea         rax,[rcx+rbx]  
00007FF621CD1B93  vaddpd      xmm1,xmm3,xmmword ptr [r10+rax]  
00007FF621CD1B99  vmovupd     xmmword ptr [rax],xmm1  
00007FF621CD1B9D  add         rcx,10h  
00007FF621CD1BA1  cmp         rcx,400h  
00007FF621CD1BA8  jb          main+0C0h (07FF621CD1B70h)

For the expression templates version:

00007FF621CD1BC0  mov         rdx,qword ptr [c]  
00007FF621CD1BC5  mov         rcx,qword ptr [rcx]  
00007FF621CD1BC8  mov         rax,qword ptr [r8]  
00007FF621CD1BCB  vmovupd     xmm0,xmmword ptr [r9+rax]  
00007FF621CD1BD1  vaddpd      xmm1,xmm0,xmmword ptr [rcx+r9]  
00007FF621CD1BD7  vaddpd      xmm0,xmm1,xmmword ptr [rdx+r9]  
00007FF621CD1BDD  lea         rax,[r9+rbx]  
00007FF621CD1BE1  vaddpd      xmm0,xmm0,xmmword ptr [rax+r10]  
00007FF621CD1BE7  vmovupd     xmmword ptr [rax],xmm0  
00007FF621CD1BEB  add         r9,10h  
00007FF621CD1BEF  cmp         r9,400h  
00007FF621CD1BF6  jae         main+154h (07FF621CD1C04h)  # extra instruction 1
00007FF621CD1BF8  mov         rcx,qword ptr [rsp+60h]     # extra instruction 2
00007FF621CD1BFD  mov         r8,qword ptr [rsp+58h]      # extra instruction 3
00007FF621CD1C02  jmp         main+110h (07FF621CD1BC0h)

Please note this is minimum verifiable code to specifically demonstrate a problem. The code was compiled using the default Release build settings in Visual Studio 2015 Update 3.

Ideas I have discounted:

the order of the loops (I have already switched the hand rolled loop and the expression templates loop to check if the compiler still inserts the extra instructions and it does)
the compiler is optimising the hand rolled loop based on the constexpr size (I have already tried test code that prevents the compiler deducing that size is constant to better optimise the hand rolled loop and it makes no difference to the hand rolled loop's instructions).

The first extra instruction is not really extra. Others are probably from printing "Done". — Ivan Aksamentov - Drop, Dec 01 '16 at 10:42
@Drop, it's not a bad guess. I already compiled the code switching round the hand rolled loop and the expression templates version of the hand rolled loop to avoid any compiler issues with the ordering of the loops and it made no difference unfortunately. — keith, Dec 01 '16 at 10:47
As a side note: in order to make disassembly meaningful, you better provide the input data dynamically (i.e. input should not be not known at compile time). Otherwise, you will see a lot of random pink elephants: smart compiler can cheat and optimize your code away fully or partially if it is capable to evaluate the result as a constant expression — Ivan Aksamentov - Drop, Dec 01 '16 at 10:53
[g++7 seems to produce optimal assembly](https://godbolt.org/g/P63XsN) for the expression template version. clang++4 has some trouble optimizing it — Vittorio Romeo, Dec 01 '16 at 10:54
@Drop, thanks for the heads up about pink elephants, I am seeing the compiler match the code when I strip everything out except the two loops (i.e. get rid of the std::cout/std::getchar), but as soon as I add any other code in, the mysterious extra instructions appear - very odd. — keith, Dec 01 '16 at 11:33
@VittorioRomeo: that gcc7 has optimized the loop away completely. clang3.9 still computes the result, and auto-vectorizes with no reloads in the loop. (It does increment every pointer separately, instead of using an indexed addressing mode, which is weird for the default `-mtune=generic`. It's [at best break-even on Sandybridge-family pre-Skylake](http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695), and worse on everything else where micro-fusion works (including Nehalem, and AMD CPUs). With `-march=haswell`, it unrolls by 2 at least.) — Peter Cordes, Dec 01 '16 at 12:01

score 3 · Accepted Answer · answered Dec 01 '16 at 11:55

Both loops seem to be reloading the array pointers every iteration. (e.g. mov r8, [c] in the first loop). The second version is just doing it even more inefficiently, with two levels on indirection. One of them coming at the end of the loop, after a conditional branch to break out of the loop.

Note that one of the changed instructions which you didn't identify as "new" is mov rcx, [rcx]. Register allocation is different between the loops, but those are the array start pointers. It (and the rcx,[rsp+60h] after the store) are replacing mov rax,qword ptr [a]. I assume a is also an offset from RSP, and not actually a label for static storage.

Presumably this is happening because MSVC++ didn't succeed at alias analysis to prove that the stores into vec[j] can't modify any of the pointers. I didn't look carefully at your templates, but if you're introducing an extra level of indirection that you'd expect to optimize away, the problem is that it isn't.

The obvious solution is to use a compiler that optimizes better. clang3.9 does well (auto-vectorizing with no reloads of pointers), and gcc optimizes it away completely because the result is not used.

But if you're stuck with MSVC, see if there are any strict-aliasing options, or no-aliasing keywords or declarations, that would be helpful. e.g. GNU C++ extensions include __restrict__ to get the same "this doesn't alias" behaviour as C99's restrict keyword. IDK if MSVC supports anything like that.

Nit-pick:

It's not quite right to call jae an "extra" instruction. It's just the opposite predicate from jb, so now it's a while(true){ ... if() break; reload; } loop instead of a more-efficient do{...}while() loop. (I'm using C syntax to show the asm loop structure. Obviously if you actually compiled those C loops, the compiler could optimize them.) So if anything, the "extra instruction" is the unconditional branch, JMP.

This is spot on, I reworked the code to get rid of the aliasing problem and it's now optimising perfectly. I would not have been able to make this step on my own, many many thanks :-) — keith, Dec 01 '16 at 14:00
@keith: cool, glad to hear this led to a practical improvement :) — Peter Cordes, Dec 01 '16 at 14:12

keith · Answer 2 · 2016-12-01T15:58:11.300

For anyone else who stumbles across this, here's a non-aliasing version that MSVC can optimise without the issue I was seeing. I've had to use some template meta programming to stop the operator overload being too greedy. Wonder if there is a simpler way...

#include <iostream>
#include <utility>
#include <type_traits>
#include <emmintrin.h>

class sreg
{
    using reg_type = __m128d;

public:
    reg_type reg_;

    sreg() {}

    sreg(const reg_type& r) :
        reg_(r)
    {
    }

    sreg operator+(const sreg& b) const
    {
        return _mm_add_pd(reg_, b.reg_);
    }
};

struct expr
{
};

template <typename... Ts>
struct meta_or : std::false_type
{
};

template <typename T, typename... Ts>
struct meta_or<T, Ts...> : std::integral_constant<bool, T::value || meta_or<Ts...>::value>
{
};

template <class... T>
using meta_is_expr = meta_or<std::is_base_of<expr, std::decay_t<T>>..., std::is_base_of<expr, T>...>;

template <class... T>
using meta_enable_if_expr = std::enable_if_t<meta_is_expr<T...>::value>;

template <typename A, typename B>
struct add_expr : public expr
{
    A a_;
    B b_;

    add_expr(A&& a, B&& b) :
        a_{ std::forward<A>(a) }, b_{ std::forward<B>(b) }
    {
    }

    sreg operator[](std::size_t i) const
    {
        return a_[i] + b_[i];
    }
};

template <typename A, typename B, typename = meta_enable_if_expr<A, B>>
inline auto operator+(A&& a, B&& b)
{
    return add_expr<A, B>{ std::forward<A>(a), std::forward<B>(b) };
}

struct svec : public expr
{
    sreg* regs_;;
    std::size_t size_;

    svec(std::size_t size) :
        size_{ size }
    {
        regs_ = static_cast<sreg*>(_aligned_malloc(size * 32, 32));
    }

    ~svec()
    {
        _aligned_free(regs_);
    }

    template <typename T>
    svec& operator=(const T& expression)
    {
        for (std::size_t i = 0; i < size(); i++)
        {
            regs_[i] = expression[i];
        }

        return *this;
    }

    const sreg& operator[](std::size_t index) const
    {
        return regs_[index];
    }

    sreg& operator[](std::size_t index)
    {
        return regs_[index];
    }

    std::size_t size() const
    {
        return size_;
    }
};

static constexpr std::size_t size = 64;

int main()
{
    svec a(size);
    svec b(size);
    svec c(size);
    svec d(size);
    svec vec(size);

    //hand rolled loop
    for (std::size_t j = 0; j < size; j++)
    {
        vec[j] = a[j] + b[j] + c[j] + d[j];
    }

    //expression templates version of hand rolled loop
    vec = a + b + c + d;

    std::cout << "Done...";

    std::getchar();

    return EXIT_SUCCESS;
}

Many thanks to @Peter Cordes for the right clue who has requested a bit of information about how the "expression" works.

For our svec the single loop happens in the assignment operator:

template <typename T>
svec& operator=(const T& expression)
{
    for (std::size_t i = 0; i < size(); i++)
    {
        regs_[i] = expression[i];
    }

    return *this;
}

The operator overload:

template <typename A, typename B, typename = meta_enable_if_expr<A>>
inline auto operator+(A&& a, B&& b)
{
    return add_expr<A, B>{ std::forward<A>(a), std::forward<B>(b) };
}

is responsible for forcing the compiler to build up an expression tree for us. By overloading the + operator on sreg and looping through our data as if it were an array of sreg the compiler will inline our expression as operators on our sreg intrinsic wrapper representing __m128d.

Each specialisation of the expression expr is a kind of functor over an sreg. I've just implemented expr_add for testing purposes.

I've seen wrappers for `__m128` / `__m128i` / `__m256...` before (e.g. Agner Fog's GPL-licensed [vectorclass C++ templates](http://www.agner.org/optimize/#vectorclass)) with operator+ and so on, but I haven't see wrappers for arbitrary-length containers before. Probably because it's hard to ensure that you don't end up with separate loops for each operator, e.g. storing `a+b`, then reading it back while adding vectors from `c`. — Peter Cordes, Dec 01 '16 at 14:17
I'm not really a template wizard, so does this actually take steps to make sure all the operations happen in one loop? If so, could you maybe add something about how it does so? If it's reliable, that's a potentially-useful thing to have. — Peter Cordes, Dec 01 '16 at 14:20
@Peter Cordes, I've added some further information, you only ever get one loop when operating on multiple "arrays" or expressions. The `sreg` would be similar to what Agner Fog has produced. — keith, Dec 01 '16 at 14:32

Simple C++ expression templates wrapping intrinsics produces different instructions

2 Answers2