First, a generic negate
function for arithmetic type vectors as an example:
#include <type_traits>
#include <vector>
...
template <typename arithmetic_type> std::vector<arithmetic_type> &
negate (std::vector<arithmetic_type> & v)
{
static_assert(std::is_arithmetic<arithmetic_type>::value,
"negate: not an arithmetic type vector");
for (auto & vi : v) vi = - vi;
// note: anticipate that a range-based for may be more amenable
// to loop-unrolling, vectorization, etc., due to fewer compiler
// template transforms, and contiguous memory / stride.
// in theory, std::transform may generate the same code, despite
// being less concise. very large vectors *may* possibly benefit
// from C++17's 'std::execution::par_unseq' policy?
return v;
}
Your wish for a canonical unary operator -
function is going to require a the creation of a temporary, in the form:
std::vector<double> operator - (const std::vector<double> & v)
{
auto ret (v); return negate(ret);
}
Or generically:
template <typename arithmetic_type> std::vector<arithmetic_type>
operator - (const std::vector<arithmetic_type> & v)
{
auto ret (v); return negate(ret);
}
Do not be tempted to implement the operator as:
template <typename arithmetic_type> std::vector<arithmetic_type> &
operator - (std::vector<arithmetic_type> & v)
{
return negate(v);
}
While (- v)
will negate the elements and return the modified vector without the need for a temporary, it breaks mathematical conventions by effectively setting: v = - v;
If that's your goal, then use the negate
function. Don't break expected operator evaluation!
clang, with avx512 enabled, generates this loop, negating an impressive 64 doubles per iteration - between pre/post length handling:
vpbroadcastq LCPI0_0(%rip), %zmm0
.p2align 4, 0x90
LBB0_21:
vpxorq -448(%rsi), %zmm0, %zmm1
vpxorq -384(%rsi), %zmm0, %zmm2
vpxorq -320(%rsi), %zmm0, %zmm3
vpxorq -256(%rsi), %zmm0, %zmm4
vmovdqu64 %zmm1, -448(%rsi)
vmovdqu64 %zmm2, -384(%rsi)
vmovdqu64 %zmm3, -320(%rsi)
vmovdqu64 %zmm4, -256(%rsi)
vpxorq -192(%rsi), %zmm0, %zmm1
vpxorq -128(%rsi), %zmm0, %zmm2
vpxorq -64(%rsi), %zmm0, %zmm3
vpxorq (%rsi), %zmm0, %zmm4
vmovdqu64 %zmm1, -192(%rsi)
vmovdqu64 %zmm2, -128(%rsi)
vmovdqu64 %zmm3, -64(%rsi)
vmovdqu64 %zmm4, (%rsi)
addq $512, %rsi ## imm = 0x200
addq $-64, %rdx
jne LBB0_21
gcc-7.2.0 generates a similar loop, but appears to insist on indexed addressing.