Expression Template implementation not being optimized

Question

I'm trying to understand the concept of expression templates in C++, as such I've cobbled together pieces of example code etc to produce a simple vector and associated expression template infrastructure to support only binary operators (+,-,*).

Everything compiles, however I've noticed the performance difference between the standard hand written loop versus the expression template variant is quite large. ET is nearly twice as slow as the hand written. I expected a difference but not that much.

A complete code listing can be found here:

https://gist.github.com/BernieWt/769a4a3ceb90bb0cae9e

(apologies for the messy code.)

.

In short I'm essentially comparing the following two loops:

ET:

for (std::size_t i = 0 ; i < rounds; ++i)
{
   v4 = ((v0 - v1) + (v2 * v3)) + v4;
   total += v4[0];
}

HW:

for (std::size_t i = 0 ; i < rounds; ++i)
{
   for (std::size_t x = 0; x < N; ++x)
   {
      v4[x] = (v0[x] - v1[x]) + (v2[x] * v3[x]) + v4[x];
   }
   total += v4[0];
}

When I disassemble the output, the following is produced, the difference is clearly the extra memcpy and several 64-bit loads that occurs during the return of the ET variant:

Standard Loop                           | Expression Template
----------------------------------------+--------------------------------
L26:                                    | L12:
xor   edx, edx                          | xor   edx, edx
jmp   .L27                              | jmp   .L13
L28:                                    | L14:
movsd xmm3, QWORD PTR [rsp+2064+rdx*8]  | movsd xmm3, QWORD PTR [rsp+2064+rdx*8]
L27:                                    | L13:
movsd xmm2, QWORD PTR [rsp+1040+rdx*8]  | movsd xmm1, QWORD PTR [rsp+1552+rdx*8]
movsd xmm1, QWORD PTR [rsp+16+rdx*8]    | movsd xmm2, QWORD PTR [rsp+16+rdx*8]
mulsd xmm2, QWORD PTR [rsp+1552+rdx*8]  | mulsd xmm1, QWORD PTR [rsp+1040+rdx*8]
subsd xmm1, QWORD PTR [rsp+528+rdx*8]   | subsd xmm2, QWORD PTR [rsp+528+rdx*8]
addsd xmm1, xmm2                        | addsd xmm1, xmm2
addsd xmm1, xmm3                        | addsd xmm1, xmm3
movsd QWORD PTR [rsp+2064+rdx*8], xmm1  | movsd QWORD PTR [rsp+2576+rdx*8], xmm1
add   rdx, 1                            | add   rdx, 1
cmp   rdx, 64                           | cmp   rdx, 64
jne   .L28                              | jne   .L14
                                        | mov   dx, 512
                                        | movsd QWORD PTR [rsp+8], xmm0
                                        | lea   rsi, [rsp+2576]
                                        | lea   rdi, [rsp+2064]
                                        | call  memcpy
movsd xmm3, QWORD PTR [rsp+2064]        | movsd xmm0, QWORD PTR [rsp+8]
sub   rcx, 1                            | sub   rbx, 1
                                        | movsd xmm3, QWORD PTR [rsp+2064]
addsd xmm0, xmm3                        | addsd xmm0, xmm3
jne   .L26                              | jne   .L12

My question is: At this point I'm stuck on how to go about removing the copy, I essentially want to update v4 in place without the copy. Any ideas on how to go about doing this?

Note1: I've tried GCC 4.7/9, Clang 3.3, VS2010/2013 - I get roughly the same performance profile on all the compilers mentioned.

Note2: I've also tried forward declaring bin_exp for vec and then adding the following assignment operator and removing the conversion operator from bin_exp,~~but to no avail~~:

template<typename LHS, typename RHS, typename Op>
inline vec<N>& operator=(const bin_exp<LHS,RHS,Op,N>& o)
{
   for (std::size_t i = 0; i < N; ++i)  { d[i] = o[i]; }
   return *this;
}

UPDATE The solution presented in NOTE 2 is actually correct. and does cause the compiler to generate code near identical the the hand written loop.

.

On another note, if I rewrite the use-case for the ET variant to be as follows:

auto expr = ((v0 - v1) + (v2 * v3)) + v4;

//auto& expr = ((v0 - v1) + (v2 * v3)) + v4;   same problem
//auto&& expr = ((v0 - v1) + (v2 * v3)) + v4;   same problem

for (std::size_t i = 0 ; i < rounds; ++i)
{
   v4 = expr
   total += v4[0];
}

A crash occurs because the temporaries (rvalues) that are produced during the instantiation of the ET are destroyed prior to the assignment. I was wondering if there's any way using C++11 to cause a compiler error.

@Adam: Do you mean the disassembly of the initialization of the vectors? — , Nov 21 '13 at 03:12
No, just what those vectors are. To save having to wade through that rather long file. — Adam, Nov 21 '13 at 03:13
@Adam: https://gist.github.com/BernieWt/769a4a3ceb90bb0cae9e#file-gistfile1-txt-L192 and https://gist.github.com/BernieWt/769a4a3ceb90bb0cae9e#file-gistfile1-txt-L214 Please note the disassembly for the initialisation for both variants are identical. — , Nov 21 '13 at 03:15
Are those two equivalent? The expression template one is producing a brand new `std::vector` then assigning it to `v4`, while the hand loop is modifying `v4` in place, I would guess. Change the hand loop to do the same (ie, make a new `v5` each outer loop via `push_back`) to compare apples to apples. To improve the expression template, it has to be aware that it is being assigned to a pre exiting `vector`, which means `vector = expression` has have support to do inplace editing, not just `expression::operator vector`. — Yakk - Adam Nevraumont, Nov 21 '13 at 03:20
@Yakk: push_back? This is not a std::vector. It's intended to be a very simple ET implementation for a bare-bones statically sized vector type. — , Nov 21 '13 at 03:23
Oh and expr templates have to detect rvalues and store copies cia move to avoid that crash problem. Take `T&& t` and store `T` via `std::forward(t)` for that effect. — Yakk - Adam Nevraumont, Nov 21 '13 at 03:23
@Yakk: If it's not too much trouble can you please provide a more detailed answer in the answers section. — , Nov 21 '13 at 03:25
Ah I see the source now. Ya, makes it easier. To test apples to apples, create v5 and then assign to v4. Do that for sanity. Then do the above perfect forwarding. And add `operator=` support to your vec with an expression template on the right (maybe sfinae test enabled) to fill in-place. The expression template needs a `[]` equivalent too. (on phone, inlaws in computer room, what ya gonna do?) I see your `[]` now-- so need `T&&` and `operator=(bin_op)` in `vec`. — Yakk - Adam Nevraumont, Nov 21 '13 at 03:27

score 0 · Answer 1 · edited May 23 '17 at 12:31

0

C++11 introduced move semantics to reduce the number of needless copies.

Your code is fairly obfuscated, but I think this should do the trick

In your struct vec replace

value_type d[N];

with

std::vector<value_type> d;

and add d(N) to the constructor initialization list. std::array is the obvious choice, but that would mean moving each element (i.e. the copy you're trying to avoid).

then add a move constructor:

vec(vec&& from): d(std::move(from.d))
{
}

The move constructor lets the new object "steal" the contents of the old one. In other words, instead of copying the entire vector (array) only the pointer to the array is copied.

edited May 23 '17 at 12:31

Community

1
1

answered Nov 21 '13 at 03:24

Adam

16,808
7
52
98

Already tried adding a move constructor, had no effect, Please note the vector's size is statically typed. Also I'm lead to believe that ETs worked quiet well long before c++11 changes came through... ;) – Nov 21 '13 at 03:26

David Rodríguez - dribeas · Accepted Answer · 2013-11-21T04:16:58.787

The point of expression templates is that the evaluation of the subexpressions can lead to temporaries that would incur a cost and provide no benefit. In your code you are not really comparing apples to apples. The two alternative to compare would be:

// Traditional
vector operator+(vector const& lhs, vector const& rhs);
vector operator-(vector const& lhs, vector const& rhs);
vector operator*(vector const& lhs, vector const& rhs);

With those definitions for the operations, the expression that you want solved:

v4 = ((v0 - v1) + (v2 * v3)) + v4;

Becomes (providing names to all temporaries):

auto __tmp1 = v0 - v1;
auto __tmp2 = v2 * v3;
auto __tmp3 = __tmp1 + __tmp2;
auto __tmp4 = __tmp3 + v4;
// assignment is not really part of the expression
v4 = __tmp4;

As you see there are 4 temporary objects, which if you use expression templates get reduced to the bare minimum: a single temporary since any of those operations generates an out-of-place value.

In your hand rolled version of the code you are not performing the same operations, you are rather unrolling the whole loop and taking advantage of knowledge of the complete operation, and not really the same operation, since by knowing that you would assign at the end of the expression to one of the elements, you transformed the expression into:

v4 += ((v0 - v1) + (v2 * v3));

Now consider what would happen if instead of assigning to one of the vectors that takes part of the expression, you were creating a new vector v5. Try the expression:

auto v5 = ((v0 - v1) + (v2 * v3)) + v4;

The magic of expression templates is that you can provide an implementation for the operators that work on the template that is just as efficient as the manual implementation, and user code is much simpler and less error prone (no need to iterate over all of the elements of the vectors with the potential for errors, or cost of maintenance as the internal representation of the vectors need to be known at each place where an arithmetic operation is performed)

I essentially want to update v4 in place without the copy

With expression templates and your current interface for the vector, you are going to pay for the temporary and the copy. The reason is that during the (conceptual) evaluation of the expression a new vector is created, while it might seem obvious for you that v4 = ... + v4; is equivalent to v4 += ..., that transformation cannot be done by the compiler or the expression template. You could, on the other hand, provide an overload of vector::operator+= (maybe even operator=) that takes an expression template, and does the operation in place.

Providing the assignment operator that assigns from the expression template and building with g++4.7 -O2 this is the generated assembly for both loops:

    call    __ZNSt6chrono12system_clock3nowEv   |    call    __ZNSt6chrono12system_clock3nowEv  
    movl    $5000000, %ecx                      |    movl    $5000000, %ecx                     
    xorpd   %xmm0, %xmm0                        |    xorpd   %xmm0, %xmm0                       
    movsd   2064(%rsp), %xmm3                   |    movsd   2064(%rsp), %xmm3                  
    movq    %rax, %rbx                          |    movq    %rax, %rbx                         
    .align 4                                    |    .align 4                                   
L9:                                             |L15:                                           
    xorl    %edx, %edx                          |    xorl    %edx, %edx                         
    jmp L8                                      |    jmp L18                                    
    .align 4                                    |    .align 4                                   
L32:                                            |L16:                                           
    movsd   2064(%rsp,%rdx,8), %xmm3            |    movsd   2064(%rsp,%rdx,8), %xmm3           
L8:                                             |L18:                                           
    movsd   1552(%rsp,%rdx,8), %xmm1            |    movsd   1040(%rsp,%rdx,8), %xmm2           
    movsd   16(%rsp,%rdx,8), %xmm2              |    movsd   16(%rsp,%rdx,8), %xmm1             
    mulsd   1040(%rsp,%rdx,8), %xmm1            |    mulsd   1552(%rsp,%rdx,8), %xmm2           
    subsd   528(%rsp,%rdx,8), %xmm2             |    subsd   528(%rsp,%rdx,8), %xmm1            
    addsd   %xmm2, %xmm1                        |    addsd   %xmm2, %xmm1                       
    addsd   %xmm3, %xmm1                        |    addsd   %xmm3, %xmm1                       
    movsd   %xmm1, 2064(%rsp,%rdx,8)            |    movsd   %xmm1, 2064(%rsp,%rdx,8)           
    addq    $1, %rdx                            |    addq    $1, %rdx                           
    cmpq    $64, %rdx                           |    cmpq    $64, %rdx                          
    jne L32                                     |    jne L16                                    
    movsd   2064(%rsp), %xmm3                   |    movsd   2064(%rsp), %xmm3                  
    subq    $1, %rcx                            |    subq    $1, %rcx                           
    addsd   %xmm3, %xmm0                        |    addsd   %xmm3, %xmm0                       
    jne L9                                      |    jne L15                                    
    movsd   %xmm0, (%rsp)                       |    movsd   %xmm0, (%rsp)                      
    call    __ZNSt6chrono12system_clock3nowEv   |    call    __ZNSt6chrono12system_clock3nowEv

Interesting comment, though I think comparing the two version is somewhat valid. That said, wrt: "provide an overload of vector::operator+= (maybe even operator=) " Please review Note2. — , Nov 21 '13 at 03:49
@BernieWhitfold: You can also compare the operation performed with vectorized instructions in assembly. That is also a valid comparison. It will be faster than the expression template if you do it appropriately. Does that mean that expression templates are not useful? No, it only means that when you compare it with manual assembly it can be less efficient, or when you compare with manual unrolling of the loop it can be about the same, but when you compare it with the naive approach where user code does not want to iterate over the container but just call the operations it will be better. — David Rodríguez - dribeas, Nov 21 '13 at 03:55
@BernieWhitfold: What do you mean when you say in the second note *to no avail*? — David Rodríguez - dribeas, Nov 21 '13 at 04:05
I was hoping that I there was something I was missing with regards to the final assignment person an overload or some such. I guess one could "retard" the hand-written loop to match the ET variant - but was really wanting to go the other way and optimize the ET — , Nov 21 '13 at 04:11
I tried that and the assembly is the same, it seems like the copy constructor is being invoked - perhaps I should make it private. — , Nov 21 '13 at 04:12
@BernieWhitfold: Providing the assignment expression, I get the same cost for both approaches (hand rolled vs. expression template). Looking at the generated assembly I don't see any notable difference (in particular, no memcpy). — David Rodríguez - dribeas, Nov 21 '13 at 04:13
Which compiler are you using? could you provide details like optimisation level and any special inlining params you may have used. — , Nov 21 '13 at 04:48
That is g++ 4.7 with -O2 running on a mac. I would have tried clang++, but the version that comes in the system does not have the `chrono` header. — David Rodríguez - dribeas, Nov 21 '13 at 04:50
You're absolutely right! I did a clean and recompile with gcc 4.9 using the assignment operator in vec and it actually came to be nearly the same as the hand-written loop. No idea what went wrong. That said thanks for the explanation it was very informative! — , Nov 21 '13 at 06:54

Expression Template implementation not being optimized

2 Answers2

Linked