What is the fastest way to convert a large c-array of char8 to short16?

Question

My raw data is a bunch of c-array of (unsigned) char (8bit) of length > 1000000. I want to add them together (vector addition) follow the rule as in the code below. Result: c-array of (unsigned) short (16bit).

I have read all the SSE and AVX/AVX2 but there just a similar call that multiple 2 registers of 256bit. The first 4 32bit will be multiplied together, the result for each pair of 32bit is a 64bit will fit into the 256 register.( _mm256_mul_epi32, _mm256_mul_epu32)

Firgure

https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX

Sample code:

static inline void adder(uint16_t *canvas, uint8_t *addon, uint64_t count)
{
    for (uint64_t i=0; i<count; i++)
        canvas[i] += static_cast<uint16_t>(addon[i]);
}

Thanks

Just a suggestion: write it out in reasonably efficient-looking C++ code, compile it and have a look at the generated assembly. My money is that you'll find an optimising compiler's code hard to beat. — Bathsheba, Jan 17 '19 at 10:29
Do it in C, and as short as possible. It is very probable that compiler (and optimizer) will find a good code for you (you may need to specify which is the target and minimum supported processor). — Giacomo Catenazzi, Jan 17 '19 at 10:34
Do what @Bathsheba suggests. Even if it's non-optimal, it will be your baseline for your future perf tests. — YSC, Jan 17 '19 at 10:39
Are you sure that `canvas[i]` is large enough to hold consecutive additions? I'd expect to see canvas overflowing pretty soon. — dsp_user, Jan 17 '19 at 10:48

Mike Vine · Accepted Answer · 2019-01-17T13:06:18.713

Adding onto @wim answer (which is a good answer) and taking @Bathsheba comment into account, its well worthwhile both trusting the compiler but also examining what your compiler outputs to both learn how to do this and also check that its doing what you'd want. Running a slightly modified version of your code through godbolt (for msvc, gcc and clang) gives some non perfect answers.

This is especially true if you limit yourself to SSE2 and below which this answer assumes (and what I tested)

All compilers both vectorise and unroll the code and use punpcklbw to 'unpack' the uint8_t's into uint16_t's and then run a SIMD add and save. This is good. However, MSVC tends to spill unnecessarily in the inner loop, and clang only uses punpcklbw and not punpckhbw which means it loads the source data twice. GCC gets the SIMD part right but has higher overhead for the loop constraints.

So theoretically if you wanted to improve these versions you can roll your own using intrinsics which would look something like:

static inline void adder2(uint16_t *canvas, uint8_t *addon, uint64_t count)
{
    uint64_t count32 = (count / 32) * 32;
    __m128i zero = _mm_set_epi32(0, 0, 0, 0);
    uint64_t i = 0;
    for (; i < count32; i+= 32)
    {
        uint8_t* addonAddress = (addon + i);

        // Load data 32 bytes at a time and widen the input
        // to `uint16_t`'sinto 4 temp xmm reigsters.
        __m128i input = _mm_loadu_si128((__m128i*)(addonAddress + 0));
        __m128i temp1 = _mm_unpacklo_epi8(input, zero);
        __m128i temp2 = _mm_unpackhi_epi8(input, zero);
        __m128i input2 = _mm_loadu_si128((__m128i*)(addonAddress + 16));
        __m128i temp3 = _mm_unpacklo_epi8(input2, zero);
        __m128i temp4 = _mm_unpackhi_epi8(input2, zero);

        // Load data we need to update
        uint16_t* canvasAddress = (canvas + i);
        __m128i canvas1 = _mm_loadu_si128((__m128i*)(canvasAddress + 0));
        __m128i canvas2 = _mm_loadu_si128((__m128i*)(canvasAddress + 8));
        __m128i canvas3 = _mm_loadu_si128((__m128i*)(canvasAddress + 16));
        __m128i canvas4 = _mm_loadu_si128((__m128i*)(canvasAddress + 24));

        // Update the values
        __m128i output1 = _mm_add_epi16(canvas1, temp1);
        __m128i output2 = _mm_add_epi16(canvas2, temp2);
        __m128i output3 = _mm_add_epi16(canvas3, temp3);
        __m128i output4 = _mm_add_epi16(canvas4, temp4);

        // Store the values
        _mm_storeu_si128((__m128i*)(canvasAddress + 0), output1);
        _mm_storeu_si128((__m128i*)(canvasAddress + 8), output2);
        _mm_storeu_si128((__m128i*)(canvasAddress + 16), output3);
        _mm_storeu_si128((__m128i*)(canvasAddress + 24), output4);
    }

    // Mop up
    for (; i<count; i++)
        canvas[i] += static_cast<uint16_t>(addon[i]);
}

Examining the output for this it is strictly better than any of gcc/clang/msvc. So if you want to get the absolute last drop of perf (and have a fixed architecture) then something like the above is a possibility. However its a really small improvement as the compilers already handle this almost perfectly and so I'd actually recommend not doing this and just trusting the compiler.

If you do think you can improve the compiler, remember to always test and profile to make sure you actually are.

Thanks for your compliment. Nice addition. – wim Jan 17 '19 at 14:19 — wim, Jan 17 '19 at 14:19

score 4 · Answer 2 · answered Jan 17 '19 at 12:04

Indeed the comments are right: the compiler can do the vectorization for you. I have modified your code a bit to improve the auto-vectorization. With gcc -O3 -march=haswell -std=c++14 (gcc version 8.2), the following code:

#include <cstdint>
#include <immintrin.h>

void cvt_uint8_int16(uint16_t * __restrict__ canvas, uint8_t * __restrict__ addon, int64_t count) {
    int64_t i;
    /* If you know that n is always a multiple of 32 then insert       */
    /* n = n & 0xFFFFFFFFFFFFFFE0u;                                    */
    /* This leads to cleaner code. Now assume n is a multiple of 32:   */
    count = count & 0xFFFFFFFFFFFFFFE0u;                               
    for (i = 0; i < count; i++){
        canvas[i] += static_cast<uint16_t>(addon[i]);
    }
}

compiles to:

cvt_uint8_int16(unsigned short*, unsigned char*, long):
        and     rdx, -32
        jle     .L5
        add     rdx, rsi
.L3:
        vmovdqu ymm2, YMMWORD PTR [rsi]
        add     rsi, 32
        add     rdi, 64
        vextracti128    xmm1, ymm2, 0x1
        vpmovzxbw       ymm0, xmm2
        vpaddw  ymm0, ymm0, YMMWORD PTR [rdi-64]
        vpmovzxbw       ymm1, xmm1
        vpaddw  ymm1, ymm1, YMMWORD PTR [rdi-32]
        vmovdqu YMMWORD PTR [rdi-64], ymm0
        vmovdqu YMMWORD PTR [rdi-32], ymm1
        cmp     rdx, rsi
        jne     .L3
        vzeroupper
.L5:

Compiler Clang produces code which is a bit different: It loads 128 bit (char)vectors and converts them with vpmovzxbw. Compiler gcc loads 256 bit (char) vectors and converts the upper and the lower 128 bits separately, which is probably slightly less efficient. Nevertheless, your problem is likely bandwidth limited anyway (since length > 1000000).

You can also vectorize the code with intrinsics (not tested):

void cvt_uint8_int16_with_intrinsics(uint16_t * __restrict__ canvas, uint8_t * __restrict__ addon, int64_t count) {
    int64_t i;
    /* Assume n is a multiple of 16  */
    for (i = 0; i < count; i=i+16){
        __m128i x     = _mm_loadu_si128((__m128i*)&addon[i]);
        __m256i y     = _mm256_loadu_si256((__m256i*)&canvas[i]);
        __m256i x_u16 = _mm256_cvtepu8_epi16(x);
        __m256i sum   = _mm256_add_epi16(y, x_u16);
                _mm256_storeu_si256((__m256i*)&canvas[i], sum);
    }
}

This leads to similar results as the auto-vectorized code.

Nice point about being bandwidth limited with such simple calculations anyway. Although the existence of SMT and also power usage means that doing less even when you are bandwidth limited is still worthwhile. OT: I'd like to find a profiler which can help here - all the ones I use are strictly "how fast does this run" and don't help with being a good citizen of the CPU. — Mike Vine, Jan 17 '19 at 15:42
GCC's code-gen might be good on Intel's Sunny Cove (successor to Ice Lake), which can do 2x load + 2x store per clock, and will have 2 vector shuffle ports. So using 3 shuffles to unpack a 256-bit vector load might be a win vs. 2x `vpmovzxbw ymm0, [rsi +0 / +16]`. Memory-source `vpmovzx ymm` can't micro-fuse either, only the xmm version can, on Skylake, so using fewer instructions with 2 loads doesn't help front-end throughput. It sucks that there's no version of `vpmovz/sx ymm, ymm` that reads from the high lane, though! That would be a really nice alternative to in-lane `vpunpckh ymm` — Peter Cordes, Jan 17 '19 at 22:26
Certainly, Sunny Cove is promising, but Cannon Lake was also promising, [back in 2014.](https://wccftech.com/intels-cannonlake-10nm-microarchitecture-due-2016-compatible-union-bay-union-point-pch/) Currently a few Intel NUCs with core i3 8121u already exist. So, unfortunately, it might take a while before the production of Sunny Cove will start, but it will be a huge step forward. — wim, Jan 18 '19 at 15:19

Michael Kenzel · Answer 3 · 2019-01-17T18:11:23.297

In contrast to the manually-optimized approaches presented in wim's and Mike's great answers, let's also have a quick look at what a completely vanilla C++ implementation would give us:

std::transform(addon, addon + count, canvas, canvas, std::plus<void>());

Try it out here. You'll see that even without any real effort on your part, the compiler is already able to produce vectorized code that is quite good given that it cannot make any assumptions concerning alignment and size of your buffers, and there's also some potential aliasing issues (due to the use of uint8_t which, unfortunately, forces the compiler to assume that the pointer may alias to any other object). Also, note that the code is basically identical to what you'd get from a C-style implementation (depending on the compiler, the C++ version has a few instructions more or a few instructions less)

void f(uint16_t* canvas, const uint8_t* addon, size_t count)
{
    for (size_t i = 0; i < count; ++i)
        canvas[i] += addon[i];
}

However, the generic C++ solution works on any combination of different kinds of container and element types as long as the element types can be added. So—as also pointed out in the other answers—while it is certainly possible to get a slightly more efficient implementation from manual optimization, one can go a long way just by writing plain C++ code (if done right). Before resorting to manually writing SSE intrinsics, consider that a generic C++ solution is more flexible, easier to maintain, and, especially, more portable. By the simple flip of the target architecture switch, you can have it produce code of similar quality not only for SSE, but AVX, or even ARM with NEON and whatever other instruction sets you may happen to want to run on. If you need your code to be perfect down to the last instruction for one particular use case on one particular CPU, then yes, intrinsics or even inline assembly is probably the way to go. But in general, I would also suggest to instead focus on writing your C++ code in a way that enables and guides the compiler to generate the assembly you want rather than generating the assembly yourself. For example, by using the (non-standard but generally available) restrict qualifier and borrowing the trick with letting the compiler know that your count is always a multiple of 32

void f(std::uint16_t* __restrict__ canvas, const std::uint8_t* __restrict__ addon, std::size_t count)
{
    assert(count % 32 == 0);
    count = count & -32;
    std::transform(addon, addon + count, canvas, canvas, std::plus<void>());
}

you get (-std=c++17 -DNDEBUG -O3 -mavx)

f(unsigned short*, unsigned char const*, unsigned long):    
        and     rdx, -32
        je      .LBB0_3
        xor     eax, eax
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpmovzxbw       xmm0, qword ptr [rsi + rax] # xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
        vpmovzxbw       xmm1, qword ptr [rsi + rax + 8] # xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
        vpmovzxbw       xmm2, qword ptr [rsi + rax + 16] # xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
        vpmovzxbw       xmm3, qword ptr [rsi + rax + 24] # xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
        vpaddw  xmm0, xmm0, xmmword ptr [rdi + 2*rax]
        vpaddw  xmm1, xmm1, xmmword ptr [rdi + 2*rax + 16]
        vpaddw  xmm2, xmm2, xmmword ptr [rdi + 2*rax + 32]
        vpaddw  xmm3, xmm3, xmmword ptr [rdi + 2*rax + 48]
        vmovdqu xmmword ptr [rdi + 2*rax], xmm0
        vmovdqu xmmword ptr [rdi + 2*rax + 16], xmm1
        vmovdqu xmmword ptr [rdi + 2*rax + 32], xmm2
        vmovdqu xmmword ptr [rdi + 2*rax + 48], xmm3
        add     rax, 32
        cmp     rdx, rax
        jne     .LBB0_2
.LBB0_3:
        ret

which is really not bad…

@wim's answer isn't "manually optimized", it's just using the OP's loop directly. Using the same masking as you did to tell the compiler it's a whole number of aligned vectors. It showed gcc's AVX2 code-gen while you're showing clang's AVX1 code-gen (where it uses indexed addressing modes for memory operands that defeat micro-fusion for AVX on Intel Sandybridge/Haswell/Skylake. Partly defeating the purpose of unrolling. [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634). But nothing you can do about that code-gen choice in the source code.) — Peter Cordes, Jan 17 '19 at 22:30
And BTW, no the asm you linked from `gcc -O3 -mavx2` is not good. It uses `-mavx256-split-unaligned-load` from tune=generic to create a YMM with `vmovdqu xmm` + `vinserti128 ymm, [mem]`, and then unpacks that YMM with `vextracti128` so it can `vpmovzx`. This makes the shuffle bottleneck way worse. (This is [gcc bug 82136](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136).) You should compile with `-march=haswell`, not just `-mavx2` to avoid this kind of mess. — Peter Cordes, Jan 17 '19 at 22:36

What is the fastest way to convert a large c-array of char8 to short16?

3 Answers3