C++ - Apply std::exp to an std::vector

Question

Is there a faster (from a performance perspective) way than simply do

std::vector<double> y;
y.reserve(x.size());
for(size_t i = 0; i < x.size(); ++i)
    y.push_back(std::exp(x[i]));

This seems to me like an operation that can be done very efficiently using a GPGPU instead of a CPU. You cannot write compute kernels in standard C++ though. — eerorika, Mar 18 '20 at 02:54

score 0 · Answer 1 · answered Mar 18 '20 at 02:11

0

If you require maximum precision to the nearest ULP, this is likely as fast as you're going to get.

If you can accept some approximation errors, there are much faster methods that use SIMD.

answered Mar 18 '20 at 02:11

orlp

112,504
36
218
315

1

Should be noted that the linked solution specifically applies to `float` calculations, not `double`. SIMD does not necessarily have to sacrifice accuracy either, as `_exp` instructions exist for packed double. So, if the compiler isn't automatically vectorizing this, then some speedup can still be achieved by hand-rolling SIMD calculations for a target architecture or using a vector library. – paddy Mar 18 '20 at 02:19

hegel5000 · Answer 2 · 2020-03-18T03:16:29.187

push_back, surprisingly, has a little bit of overhead, because it doesn't actually know that you've reserved enough space, so it always has to check. Because this check can change control flow between loop iterations, push_back precludes automatic vectorization by the compiler.

Consider these two functions, where the first one uses push_back, while the second one modifies a copy (or moved-into value) in-place:

auto exp1(std::vector<double> const& xs) -> std::vector<double> {
    auto ys = std::vector<double>{};
    ys.reserve(xs.size());
    for(auto x : xs){ ys.push_back(std::exp(x)); }
}

auto exp2(std::vector<double> xs) -> std::vector<double> {
    for(auto & x : xs){ x = std::exp(x); }
    return xs;
}

We'll look at the assembly output, if compiled in GCC 9.1 with

gcc -std=c++17 -O3 -march=skylake-avx512

Here is exp1's inner loop (embedded in quite a bit of additional code which will never be executed because you've already reserved):

.L45:
        add     rbx, 8
        vmovsd  QWORD PTR [r14], xmm0
        add     r14, 8
        cmp     r12, rbx
        je      .L44
.L18:
        vmovsd  xmm0, QWORD PTR [rbx]
        call    exp
        vmovsd  QWORD PTR [rsp], xmm0
        cmp     rbp, r14
        jne     .L45

And here's exp2's:

.L53:
        vmovsd  xmm0, QWORD PTR [rbx]
        add     rbx, 8
        call    exp
        vmovsd  QWORD PTR [rbx-8], xmm0
        cmp     rbp, rbx
        jne     .L53

In practice, they are basically the same, because exp is complicated and GCC doesn't know how to automatically vectorize it. However, consider the case where something much simpler happens in the inner loop:

auto sq1(std::vector<double> const& xs) -> std::vector<double> {
    auto ys = std::vector<double>{};
    ys.reserve(xs.size());
    for(auto x : xs){ ys.push_back(x*x); }
}

auto sq2(std::vector<double> xs) -> std::vector<double> {
    for(auto & x : xs){ x *= x; }
    return xs;
}

Here's sq1's inner loop:

.L89:
        vmovsd  QWORD PTR [rsi], xmm0
        add     rbx, 8
        add     rsi, 8
        mov     QWORD PTR [rsp+24], rsi
        cmp     rbp, rbx
        je      .L72
.L75:
        vmovsd  xmm0, QWORD PTR [rbx]
        mov     rsi, QWORD PTR [rsp+24]
        vmulsd  xmm0, xmm0, xmm0
        vmovsd  QWORD PTR [rsp+8], xmm0
        cmp     rsi, QWORD PTR [rsp+32]
        jne     .L89

Here's sq2's. Note that it uses vmulpd and ymm registers, and that it jumps by 32 bytes at a time rather than 8 at a time.

.L11:
        vmovupd ymm0, YMMWORD PTR [rdx]
        add     rdx, 32
        vmulpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdx-32], ymm0
        cmp     rdx, rcx
        jne     .L11

Of course, this inner-loop snippet is a little misleading: it hides an immense amount of code used to deal with the remainder of the std::vector if its size does not divide evenly by 4. Still, my main point is that yes, you actually can do marginally better than reserve + push_back (this surprised me quite a bit when I first found out), and that it would be significantly better if we weren't dealing with exp in particular.

C++ - Apply std::exp to an std::vector

2 Answers2