In contrast to the manually-optimized approaches presented in wim's and Mike's great answers, let's also have a quick look at what a completely vanilla C++ implementation would give us:
std::transform(addon, addon + count, canvas, canvas, std::plus<void>());
Try it out here. You'll see that even without any real effort on your part, the compiler is already able to produce vectorized code that is quite good given that it cannot make any assumptions concerning alignment and size of your buffers, and there's also some potential aliasing issues (due to the use of uint8_t
which, unfortunately, forces the compiler to assume that the pointer may alias to any other object). Also, note that the code is basically identical to what you'd get from a C-style implementation (depending on the compiler, the C++ version has a few instructions more or a few instructions less)
void f(uint16_t* canvas, const uint8_t* addon, size_t count)
{
for (size_t i = 0; i < count; ++i)
canvas[i] += addon[i];
}
However, the generic C++ solution works on any combination of different kinds of container and element types as long as the element types can be added. So—as also pointed out in the other answers—while it is certainly possible to get a slightly more efficient implementation from manual optimization, one can go a long way just by writing plain C++ code (if done right). Before resorting to manually writing SSE intrinsics, consider that a generic C++ solution is more flexible, easier to maintain, and, especially, more portable. By the simple flip of the target architecture switch, you can have it produce code of similar quality not only for SSE, but AVX, or even ARM with NEON and whatever other instruction sets you may happen to want to run on. If you need your code to be perfect down to the last instruction for one particular use case on one particular CPU, then yes, intrinsics or even inline assembly is probably the way to go. But in general, I would also suggest to instead focus on writing your C++ code in a way that enables and guides the compiler to generate the assembly you want rather than generating the assembly yourself. For example, by using the (non-standard but generally available) restrict qualifier and borrowing the trick with letting the compiler know that your count
is always a multiple of 32
void f(std::uint16_t* __restrict__ canvas, const std::uint8_t* __restrict__ addon, std::size_t count)
{
assert(count % 32 == 0);
count = count & -32;
std::transform(addon, addon + count, canvas, canvas, std::plus<void>());
}
you get (-std=c++17 -DNDEBUG -O3 -mavx
)
f(unsigned short*, unsigned char const*, unsigned long):
and rdx, -32
je .LBB0_3
xor eax, eax
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpmovzxbw xmm0, qword ptr [rsi + rax] # xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpmovzxbw xmm1, qword ptr [rsi + rax + 8] # xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpmovzxbw xmm2, qword ptr [rsi + rax + 16] # xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpmovzxbw xmm3, qword ptr [rsi + rax + 24] # xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
vpaddw xmm0, xmm0, xmmword ptr [rdi + 2*rax]
vpaddw xmm1, xmm1, xmmword ptr [rdi + 2*rax + 16]
vpaddw xmm2, xmm2, xmmword ptr [rdi + 2*rax + 32]
vpaddw xmm3, xmm3, xmmword ptr [rdi + 2*rax + 48]
vmovdqu xmmword ptr [rdi + 2*rax], xmm0
vmovdqu xmmword ptr [rdi + 2*rax + 16], xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 32], xmm2
vmovdqu xmmword ptr [rdi + 2*rax + 48], xmm3
add rax, 32
cmp rdx, rax
jne .LBB0_2
.LBB0_3:
ret
which is really not bad…