Fastest way to negate a std::vector

Question

Assume I have a std::vector of double, namely

std::vector<double> MyVec(N);

Where N is so big that performance matters. Now assume that MyVec is a nontrivial vector (i.e. it is not a vector of zeros, but has been modified by some routine). Now, I need the negated version of the vector: I need -MyVec.

So far, I have been implementing it via

std::transform(MyVec.cbegin(),MyVec.cend(),MyVec.begin(),std::negate<double>());

But, really, I do not know if this is something sensible or it is just super naïve from my side.

Am I doing it correctly? Or std::transform is just a super slow routine in this case?

PS: I am using BLAS and LAPACK libraries all the time, but I have not found anything that matches this particular need. However, if there exists such a function in BLAS/LAPACK which is faster than std::transform, I would be glad to know.

the most efficient way would be to not do it at all. What I mean is the following: Just do what you would do with the negated vector with the negated values, eg `for (auto e : MyVec) { std::cout << - e << " ";}` instead of `for (auto e : MyVec) { std::cout << e << " ";}` — 463035818_is_not_an_ai, Nov 15 '17 at 14:47
As with most optimization questions, it depends on your use case. Are you writing a general purpose linear algebra library with negation? Then this is a good question. Is this for a specific algorithm you're writing? _Profile it_ and see if your current code is actually a bottleneck. — Max, Nov 15 '17 at 14:50
For vector algebra, you probably want a different abstraction than std::vector. Take a look at the Eigen (http://eigen.tuxfamily.org) library, where you could just say `-MyVec`. — Peter, Nov 15 '17 at 14:51
Since a `std::vector` stores its data in contiguous memory no different than an array, instead of thinking of `std::vector`, how would you optimize this if you have an array of `double` values? update: ok, the answer given basically answers that question. — PaulMcKenzie, Nov 15 '17 at 14:56
If your vector has any significant size (= cannot be fully cached), the algorithm that you use for inversion is pretty much irrelevant, your bottleneck will be memory itself. Just a few numbers: A `double` is 8 bytes, and let's assume that your CPU can do an inversion per nanosecond. That's already 8 GB per second that first need to be read, and then need to be stored, so 16 GB per second on the memory bus. Take into account that current CPUs are faster than that when it comes to pure arithmetic instructions, and you know that the memory bus will determine your speed. — cmaster - reinstate monica, Nov 15 '17 at 15:16
@cmaster: what kind of crappy CPU can only do one inversion per ns? 16GB per second doesn't saturate the memory bus in recent desktop CPUs; DDR3-1600 is about 25GB/s (in one direction). But even with scalar instructions only, all modern x86 CPUs can do one inversion per clock cycle, so that's more like 1/3rd of a ns (on a 3GHz CPU). Compilers easily auto-vectorize with SSE2 or AVX, so that's 2 or 4 `double`s per clock. (Well actually gcc doesn't unroll loops at all by default, so it's more like one per 1.5 clocks front-end bottleneck https://godbolt.org/g/BqqDCS). — Peter Cordes, Nov 16 '17 at 06:18
But yes, you do fairly easily saturate memory bandwidth, even rewriting in-place (with is somewhat faster than writing to new memory because that involves separate read-for-ownership traffic unless you use NT stores.) Still, tobi303's suggestion is by far the best: don't negate in place, fold it into the next step so it can use `FNMADD` instead of `FMADD` or `sub` instead of `add` for no extra cost. — Peter Cordes, Nov 16 '17 at 06:20
Is it bad if I want to answer the title question with `Go see your doctor and get some pills`? :) — flith, Nov 16 '17 at 07:25
@PeterCordes I agree with everything you said. I just didn't feel in the mood to look up any solid numbers, so I went for a ridiculously low *lower bound* to illustrate my point. Thanks for providing better numbers :-) — cmaster - reinstate monica, Nov 16 '17 at 08:58

score 28 · Accepted Answer · answered Nov 15 '17 at 17:03

28

#include <vector>
#include <algorithm>
#include <functional> 
void check()
{
    std::vector<double> MyVec(255);
    std::transform(MyVec.cbegin(),MyVec.cend(),MyVec.begin(),std::negate<double>());
}

This code on https://godbolt.org/ with copile option -O3 generate nice assembly

.L3:
[...]
  cmp r8, 254
  je .L4
  movsd xmm0, QWORD PTR [rdi+2032]
  xorpd xmm0, XMMWORD PTR .LC0[rip]
  movsd QWORD PTR [rdi+2032], xmm0
.L4:

It's difficult to imagine faster. Your code is already perfect, don't try to outsmart the compiler and use clean C++ code it works almost every times.

answered Nov 15 '17 at 17:03

ColdCat

1,192
17
29

3

There will be a processor stall on the branch and you will loose cycles based on the latency of `xorpd`. A more optimised version would typically repeat the operation several times before branching. – keith Nov 15 '17 at 17:15
2

@keith I doubt that is true on modern x86 CPUs. They are smart enough to predict several branches ahead and also to rename registers. – jpa Nov 15 '17 at 18:24
I would recommend watching this https://www.youtube.com/watch?v=2EWejmkKlxs to see how the processor is not going to stall on that branch by speculatively taking the path that continues the loop instead of exiting the loop. The video is long and in this case it's the second half of the video that is relevant here but it's worth watching anyway. – navyblue Nov 15 '17 at 18:36
2

@jpa The processor may not stall, but loop unrolling is still beneficial if the processor ends up micro-op or decode limited (which it likely will be unless it is memory bound). – EOF Nov 15 '17 at 19:06
24

This generated code is actually quite amazing given the high level nature of this C++ code. – usr Nov 15 '17 at 19:24
6

If you switch to clang-5.0.0 on the Compiler Explorer, you'll [see](https://godbolt.org/g/SAhvUj) it unroll the loop a bit. (Be sure to change `cbegin` and `cend` to `begin` and `end`, otherwise it'll not compile). – Ruslan Nov 15 '17 at 20:21
@keith If you turn on loop unrolling the compiler will decide that as the case may be. – wizzwizz4 Nov 15 '17 at 22:40
The code you show isn't a loop (no branch back to the top), and it's loading a **scalar** double (`movSd`). gcc and clang do in fact make nice-ish asm if you use `-march=haswell` (to enable AVX + AVX2 and tune for Haswell), but gcc's does kinda suck (6 fused-domain uops on Haswell, in a non-unrolled loop with two loop counters and indexed addressing modes.) On Godbolt (https://godbolt.org/g/BqqDCS) it's from `.L6` to the `jb .L6`. – Peter Cordes Nov 16 '17 at 06:13
@Ruslan: Or use `-std=gnu++11` or `14` for clang, because unlike recent gcc that's not the default. – Peter Cordes Nov 16 '17 at 07:34
Unrolling the loop does not always help. A large number of instructions can pollute the instruction cache leading to a performance hit. It is best to leave it to the compiler to decide if the loop must be unrolled. AFAIK clang is pretty good at unrolling. – lakshayg Nov 16 '17 at 14:51
@LakshayGarg: gcc without profile-guided optimization never unrolls, even for *tiny* loops, which is often sub-optimal when the actually are hot. You're right that with clang the compiler usually chooses well, though. – Peter Cordes Nov 16 '17 at 18:56

keith · Answer 2 · 2017-11-15T16:53:54.367

16

Fortunately the data in std::vector is contiguous so you can multiply by -1 using vector intrinsics (using unaligned load/stores and special handing of the possible overflow). Or use ippsMulC_64f/ippsMulC_64f_I from intel's IPP library (you'll struggle to write something faster) which will use the largest vector registers available to your platform: https://software.intel.com/en-us/ipp-dev-reference-mulc

Update: to clear up some confusion in the comments, the full version of Intel IPP is free (although you can pay for support) and comes on Linux, Windows and macOS.

edited Nov 15 '17 at 16:53

answered Nov 15 '17 at 14:51

keith

5,122
3
21
50

If you combine this suggestion (intirnsics) with OpenMP, you would get the fastest possible solution; note that this is entirely memory-bound for large arrays since it has very little arithmetic; non-temporal stores can also help depending on the case. Speaking of BLAS, a DAXPY may be the fastest routine anyway - again, you are limited by memory bandwidth, not arithmetic. Also, recent ebnough compilers may be able to vectorize/parallelize std::transform anyway... You have to measure / look at disassembly to know... – Stefan Atev Nov 15 '17 at 16:00
IMHO IPP is the best option, but I think you should say, that it's not open-source or free and it's cost starts from $249. Yes, thay have free products, but only for linux and with limited functionality – borisbn Nov 15 '17 at 16:43
2

@borisbn, there has been a free option for a while now (for all platforms and full functionality) the only difference between the paid option and the free option is that the free option just comes with no support except the intel forums for help and raising bugs – keith Nov 15 '17 at 16:50
Negating by flipping the sign bit (with `xorpd`) is more power-efficient than calling a general-purpose function that actually multiplies. But if you don't have access to a well-tuned loop to do the bit-flip and can't write one yourself, using a scalar * vector function is not bad for performance. – Peter Cordes Nov 16 '17 at 06:04

score 3 · Answer 3 · edited Nov 16 '17 at 06:21

3

As others have mentioned, it completely depends on your use case. Probably the simplest way would be something like this:

 struct MyNegatingVect {
     MyVect data;
     bool negated = false;
     void negate() { negated = !negated; }
     // ... setter and getter need indirection ...
     // ..for example
     MyVect::data_type at(size_t index) { return negated ? - data.at(index) : data.at(index);
 };

Whether this additional indirection for each single access is worth transforming the negation into setting a single bool depends, as already mentioned, on your use case (actually I doubt that there is a use case where this would bring any measurable benefit).

edited Nov 16 '17 at 06:21

porges

30,133
4
83
114

answered Nov 15 '17 at 15:04

463035818_is_not_an_ai

109,796
11
89
185

@PeteBecker I actually meant "most simplistic". Anyhow I take your edit as a compliment ;) – 463035818_is_not_an_ai Nov 15 '17 at 15:17
Feel free to roll it back if it distorts your meaning. – Pete Becker Nov 15 '17 at 16:50
This is by far the best answer, especially if the vector is more than a couple SIMD vectors long. Fold the negation into a later (or earlier step). If you do this right so the compiler can still optimize, the next step can use `subpd` instead of `addpd`, or `fnmadd` instead of `fmadd` to do the negation for free. This only works if it makes two versions of the whole loop, one for negated and one for not negated. (It may need your help to do that, by using an `if()` around your whole loop if your compiler does a poor job on its own.) – Peter Cordes Nov 16 '17 at 06:28
But even if it doesn't clone a loop, on x86 you negate using a SIMD `xorpd` to flip the sign bit. So hopefully outside the loop, the compiler could set up a constant that's either `0` or `0x8000...`, and `xor` with it inside the loop (to either negate or not depending on how the constant was set up outside the loop). – Peter Cordes Nov 16 '17 at 06:30
I had thought about this but I really don't like the performance hit of the extra branch on each access. – Surt Nov 16 '17 at 06:38
It might be more efficient to remove the branching and instead multiply by 1 or -1 (but it probably depends how often you're switching the sign). – Bernhard Barker Nov 16 '17 at 08:29
1

@Dukeling the answer was actually only meant as an academic exercise by taking the question too literally. What I would really do is what I mentioned in a comment: Not negate the vector at all. And this answer was just an attempt to back that up with some code, but imho without more information on what the vector would be used for, it is impossible to find "the fastest way" – 463035818_is_not_an_ai Nov 16 '17 at 08:40

Brett Hale · Answer 4 · 2017-11-16T16:54:15.357

First, a generic negate function for arithmetic type vectors as an example:

#include <type_traits>
#include <vector>

...

template <typename arithmetic_type> std::vector<arithmetic_type> &
negate (std::vector<arithmetic_type> & v)
{
    static_assert(std::is_arithmetic<arithmetic_type>::value,
        "negate: not an arithmetic type vector");

    for (auto & vi : v) vi = - vi;

    // note: anticipate that a range-based for may be more amenable
    // to loop-unrolling, vectorization, etc., due to fewer compiler
    // template transforms, and contiguous memory / stride.

    // in theory, std::transform may generate the same code, despite
    // being less concise. very large vectors *may* possibly benefit
    // from C++17's 'std::execution::par_unseq' policy?

    return v;
}

Your wish for a canonical unary operator - function is going to require a the creation of a temporary, in the form:

std::vector<double> operator - (const std::vector<double> & v)
{
    auto ret (v); return negate(ret);
}

Or generically:

template <typename arithmetic_type> std::vector<arithmetic_type>
operator - (const std::vector<arithmetic_type> & v)
{
    auto ret (v); return negate(ret);
}

Do not be tempted to implement the operator as:

template <typename arithmetic_type> std::vector<arithmetic_type> &
operator - (std::vector<arithmetic_type> & v)
{
    return negate(v);
}

While (- v) will negate the elements and return the modified vector without the need for a temporary, it breaks mathematical conventions by effectively setting: v = - v; If that's your goal, then use the negate function. Don't break expected operator evaluation!

clang, with avx512 enabled, generates this loop, negating an impressive 64 doubles per iteration - between pre/post length handling:

        vpbroadcastq    LCPI0_0(%rip), %zmm0
        .p2align        4, 0x90
LBB0_21:
        vpxorq  -448(%rsi), %zmm0, %zmm1
        vpxorq  -384(%rsi), %zmm0, %zmm2
        vpxorq  -320(%rsi), %zmm0, %zmm3
        vpxorq  -256(%rsi), %zmm0, %zmm4
        vmovdqu64       %zmm1, -448(%rsi)
        vmovdqu64       %zmm2, -384(%rsi)
        vmovdqu64       %zmm3, -320(%rsi)
        vmovdqu64       %zmm4, -256(%rsi)
        vpxorq  -192(%rsi), %zmm0, %zmm1
        vpxorq  -128(%rsi), %zmm0, %zmm2
        vpxorq  -64(%rsi), %zmm0, %zmm3
        vpxorq  (%rsi), %zmm0, %zmm4
        vmovdqu64       %zmm1, -192(%rsi)
        vmovdqu64       %zmm2, -128(%rsi)
        vmovdqu64       %zmm3, -64(%rsi)
        vmovdqu64       %zmm4, (%rsi)
        addq    $512, %rsi              ## imm = 0x200
        addq    $-64, %rdx
        jne     LBB0_21

gcc-7.2.0 generates a similar loop, but appears to insist on indexed addressing.

It's too bad gcc picks the worst of all worlds in that loop. It uses indexed addressing modes instead of a pointer increment even though the load and store addresses are the same. And it uses a separate loop counter instead of a `cmp` against an end address. So on Haswell it's 6 fused-domain uops (`vxorpd` doesn't micro-fuse with an indexed addressing mode, but the store does), with a front-end bottleneck of one iter per 1.5 clocks. A simple pointer-increment loop could easily be 4 uops, running one iter per 1 clock (if data is hot in L1D or maybe L2). — Peter Cordes, Nov 16 '17 at 06:35
@PeterCordes - I neglected to omit the possible loop unrolling for avx / avx512. The first listing turned out to be 'tail' handling... — Brett Hale, Nov 16 '17 at 16:49

score 0 · Answer 5 · answered Nov 16 '17 at 06:35

0

Use for_each

std::for_each(MyVec.begin(), MyVec.end(), [](double& val) { val = -val });

or C++17 parallel

std::for_each(std::execution::par_unseq, MyVec.begin(), MyVec.end(), [](double& val) { val = -val });

answered Nov 16 '17 at 06:35

Surt

15,501
3
23
39

1

Why is that better than `std::transform` with `std::negate`? It also allows an execution policty, and compiles efficiently with gcc and clang. – Peter Cordes Nov 16 '17 at 06:38
Transform has an extra iterator, which could lead to potential aliasing problems. – Surt Nov 16 '17 at 06:40
Aliasing with what? `double` can't alias with an iterator that uses integer / pointer types. Compiler output for a function with a vector arg by reference: https://godbolt.org/g/mN4fYQ. It still optimizes fully with `-fno-strict-aliasing`, so it wasn't type-based aliasing rules that let the compiler detect it, it's that the iterator is private local variable. It's easy for the compiler to prove that nothing else has a pointer to the internal iterator. – Peter Cordes Nov 16 '17 at 06:48

Fastest way to negate a std::vector

5 Answers5