Is there a faster (from a performance perspective) way than simply do
std::vector<double> y;
y.reserve(x.size());
for(size_t i = 0; i < x.size(); ++i)
y.push_back(std::exp(x[i]));
Is there a faster (from a performance perspective) way than simply do
std::vector<double> y;
y.reserve(x.size());
for(size_t i = 0; i < x.size(); ++i)
y.push_back(std::exp(x[i]));
If you require maximum precision to the nearest ULP, this is likely as fast as you're going to get.
If you can accept some approximation errors, there are much faster methods that use SIMD.
push_back
, surprisingly, has a little bit of overhead, because it doesn't actually know that you've reserved enough space, so it always has to check. Because this check can change control flow between loop iterations, push_back
precludes automatic vectorization by the compiler.
Consider these two functions, where the first one uses push_back
, while the second one modifies a copy (or moved-into value) in-place:
auto exp1(std::vector<double> const& xs) -> std::vector<double> {
auto ys = std::vector<double>{};
ys.reserve(xs.size());
for(auto x : xs){ ys.push_back(std::exp(x)); }
}
auto exp2(std::vector<double> xs) -> std::vector<double> {
for(auto & x : xs){ x = std::exp(x); }
return xs;
}
We'll look at the assembly output, if compiled in GCC 9.1 with
gcc -std=c++17 -O3 -march=skylake-avx512
Here is exp1
's inner loop (embedded in quite a bit of additional code which will never be executed because you've already reserve
d):
.L45:
add rbx, 8
vmovsd QWORD PTR [r14], xmm0
add r14, 8
cmp r12, rbx
je .L44
.L18:
vmovsd xmm0, QWORD PTR [rbx]
call exp
vmovsd QWORD PTR [rsp], xmm0
cmp rbp, r14
jne .L45
And here's exp2
's:
.L53:
vmovsd xmm0, QWORD PTR [rbx]
add rbx, 8
call exp
vmovsd QWORD PTR [rbx-8], xmm0
cmp rbp, rbx
jne .L53
In practice, they are basically the same, because exp
is complicated and GCC doesn't know how to automatically vectorize it. However, consider the case where something much simpler happens in the inner loop:
auto sq1(std::vector<double> const& xs) -> std::vector<double> {
auto ys = std::vector<double>{};
ys.reserve(xs.size());
for(auto x : xs){ ys.push_back(x*x); }
}
auto sq2(std::vector<double> xs) -> std::vector<double> {
for(auto & x : xs){ x *= x; }
return xs;
}
Here's sq1
's inner loop:
.L89:
vmovsd QWORD PTR [rsi], xmm0
add rbx, 8
add rsi, 8
mov QWORD PTR [rsp+24], rsi
cmp rbp, rbx
je .L72
.L75:
vmovsd xmm0, QWORD PTR [rbx]
mov rsi, QWORD PTR [rsp+24]
vmulsd xmm0, xmm0, xmm0
vmovsd QWORD PTR [rsp+8], xmm0
cmp rsi, QWORD PTR [rsp+32]
jne .L89
Here's sq2
's. Note that it uses vmulpd
and ymm
registers, and that it jumps by 32 bytes at a time rather than 8 at a time.
.L11:
vmovupd ymm0, YMMWORD PTR [rdx]
add rdx, 32
vmulpd ymm0, ymm0, ymm0
vmovupd YMMWORD PTR [rdx-32], ymm0
cmp rdx, rcx
jne .L11
Of course, this inner-loop snippet is a little misleading: it hides an immense amount of code used to deal with the remainder of the std::vector
if its size does not divide evenly by 4. Still, my main point is that yes, you actually can do marginally better than reserve
+ push_back
(this surprised me quite a bit when I first found out), and that it would be significantly better if we weren't dealing with exp
in particular.