Here’s a modification of Cássio Renan’s excellent answer. It replaces all compiler-specific extensions with standard C++ and is, in theory, portable to any conforming compiler. In addition, it checks that the arguments are properly aligned rather than assuming so. It optimizes to the same code.
#include <assert.h>
#include <cmath>
#include <stddef.h>
#include <stdint.h>
#define ALIGNMENT alignof(max_align_t)
using std::floor;
// Compiled with: -std=c++17 -Wall -Wextra -Wpedantic -Wconversion -fno-trapping-math -O -march=cannonlake -mprefer-vector-width=512
void testFunction(const float in[], int32_t out[], const ptrdiff_t length)
{
static_assert(sizeof(float) == sizeof(int32_t), "");
assert((uintptr_t)(void*)in % ALIGNMENT == 0);
assert((uintptr_t)(void*)out % ALIGNMENT == 0);
assert((size_t)length % (ALIGNMENT/sizeof(int32_t)) == 0);
alignas(ALIGNMENT) const float* const input = in;
alignas(ALIGNMENT) int32_t* const output = out;
// Do the conversion
for (int i = 0; i < length; ++i) {
output[i] = static_cast<int32_t>(floor(input[i]));
}
}
This doesn’t optimize quite as nicely on GCC as the original, which used non-portable extensions. The C++ standard does support an alignas
specifier, references to aligned arrays, and a std::align
function that returns an aligned range within a buffer. None of these, however, make any compiler I tested generate aligned instead of unaligned vector loads and stores.
Although alignof(max_align_t)
is only 16 on x86_64, and it is possible to define ALIGNMENT
as the constant 64, this doesn’t help any compiler generate better code, so I went for portability. The closest thing to a portable way to force the compiler to assume a poitner is aligned would be to use the types from <immintrin.h>
, which most compilers for x86 support, or define a struct
with an alignas
specifier. By checking predefined macros, you could also expand a macro to __attribute__ ((aligned (ALIGNMENT)))
on Linux compilers, or __declspec (align (ALIGNMENT))
on Windows compilers, and something safe on a compiler we don’t know about, but GCC needs the attribute on a type to actually generate aligned loads and stores.
Additionally, the original example called a bulit-in to tell GCC that it was impossible for length
not to be a multiple of 32. If you assert()
this or call a standard function such as abort()
, neither GCC, Clang nor ICC will make the same deduction. Therefore, most of the code they generate will handle the case where length
is not a nice round multiple of the vector width.
A likely reason for this is that neither optimization get you that much speed: unaligned memory instructions with aligned addresses are fast on Intel CPUs, and the code to handle the case where length
is not a nice round number is a few bytes long and runs in constant time.
As a footnote, GCC is able to optimize inline functions from <cmath>
better than the macros implemented in <math.c>
.
GCC 9.1 needs a particular set of options to generate AVX512 code. By default, even with -march=cannonlake
, it will prefer 256-bit vectors. It needs the -mprefer-vector-width=512
to generate 512-bit code. (Thanks to Peter Cordes for pointing this out.) It follows up the vectorized loop with unrolled code to convert any leftover elements of the array.
Here’s the vectorized main loop, minus some constant-time initialization, error-checking and clean-up code that will only run once:
.L7:
vrndscaleps zmm0, ZMMWORD PTR [rdi+rax], 1
vcvttps2dq zmm0, zmm0
vmovdqu32 ZMMWORD PTR [rsi+rax], zmm0
add rax, 64
cmp rax, rcx
jne .L7
The eagle-eyed will notice two differences from the code generated by Cássio Renan’s program: it uses %zmm instead of %ymm registers, and it stores the results with an unaligned vmovdqu32
rather than an aligned vmovdqa64
.
Clang 8.0.0 with the same flags makes different choices about unrolling loops. Each iteration operates on eight 512-bit vectors (that is, 128 single-precision floats), but the code to pick up leftovers is not unrolled. If there are at least 64 floats left over after that, it uses another four AVX512 instructions for those, and then cleans up any extras with an unvectorized loop.
If you compile the original program in Clang++, it will accept it without complaint, but won’t make the same optimizations: it will still not assume that the length
is a multiple of the vector width, nor that the pointers are aligned.
It prefers AVX512 code to AVX256, even without -mprefer-vector-width=512
.
test rdx, rdx
jle .LBB0_14
cmp rdx, 63
ja .LBB0_6
xor eax, eax
jmp .LBB0_13
.LBB0_6:
mov rax, rdx
and rax, -64
lea r9, [rax - 64]
mov r10, r9
shr r10, 6
add r10, 1
mov r8d, r10d
and r8d, 1
test r9, r9
je .LBB0_7
mov ecx, 1
sub rcx, r10
lea r9, [r8 + rcx]
add r9, -1
xor ecx, ecx
.LBB0_9: # =>This Inner Loop Header: Depth=1
vrndscaleps zmm0, zmmword ptr [rdi + 4*rcx], 9
vrndscaleps zmm1, zmmword ptr [rdi + 4*rcx + 64], 9
vrndscaleps zmm2, zmmword ptr [rdi + 4*rcx + 128], 9
vrndscaleps zmm3, zmmword ptr [rdi + 4*rcx + 192], 9
vcvttps2dq zmm0, zmm0
vcvttps2dq zmm1, zmm1
vcvttps2dq zmm2, zmm2
vmovups zmmword ptr [rsi + 4*rcx], zmm0
vmovups zmmword ptr [rsi + 4*rcx + 64], zmm1
vmovups zmmword ptr [rsi + 4*rcx + 128], zmm2
vcvttps2dq zmm0, zmm3
vmovups zmmword ptr [rsi + 4*rcx + 192], zmm0
vrndscaleps zmm0, zmmword ptr [rdi + 4*rcx + 256], 9
vrndscaleps zmm1, zmmword ptr [rdi + 4*rcx + 320], 9
vrndscaleps zmm2, zmmword ptr [rdi + 4*rcx + 384], 9
vrndscaleps zmm3, zmmword ptr [rdi + 4*rcx + 448], 9
vcvttps2dq zmm0, zmm0
vcvttps2dq zmm1, zmm1
vcvttps2dq zmm2, zmm2
vcvttps2dq zmm3, zmm3
vmovups zmmword ptr [rsi + 4*rcx + 256], zmm0
vmovups zmmword ptr [rsi + 4*rcx + 320], zmm1
vmovups zmmword ptr [rsi + 4*rcx + 384], zmm2
vmovups zmmword ptr [rsi + 4*rcx + 448], zmm3
sub rcx, -128
add r9, 2
jne .LBB0_9
test r8, r8
je .LBB0_12
.LBB0_11:
vrndscaleps zmm0, zmmword ptr [rdi + 4*rcx], 9
vrndscaleps zmm1, zmmword ptr [rdi + 4*rcx + 64], 9
vrndscaleps zmm2, zmmword ptr [rdi + 4*rcx + 128], 9
vrndscaleps zmm3, zmmword ptr [rdi + 4*rcx + 192], 9
vcvttps2dq zmm0, zmm0
vcvttps2dq zmm1, zmm1
vcvttps2dq zmm2, zmm2
vcvttps2dq zmm3, zmm3
vmovups zmmword ptr [rsi + 4*rcx], zmm0
vmovups zmmword ptr [rsi + 4*rcx + 64], zmm1
vmovups zmmword ptr [rsi + 4*rcx + 128], zmm2
vmovups zmmword ptr [rsi + 4*rcx + 192], zmm3
.LBB0_12:
cmp rax, rdx
je .LBB0_14
.LBB0_13: # =>This Inner Loop Header: Depth=1
vmovss xmm0, dword ptr [rdi + 4*rax] # xmm0 = mem[0],zero,zero,zero
vroundss xmm0, xmm0, xmm0, 9
vcvttss2si ecx, xmm0
mov dword ptr [rsi + 4*rax], ecx
add rax, 1
cmp rdx, rax
jne .LBB0_13
.LBB0_14:
pop rax
vzeroupper
ret
.LBB0_7:
xor ecx, ecx
test r8, r8
jne .LBB0_11
jmp .LBB0_12
ICC 19 also generates AVX512 instructions, but very different from clang
. It does more set-up with magic constants, but does not unroll any loops, operating instead on 512-bit vectors.
This code also works on other compilers and architectures. (Although MSVC only supports the ISA up to AVX2 and cannot auto-vectorize the loop.) On ARM with -march=armv8-a+simd
, for example, it generates a vectorized loop with frintm v0.4s, v0.4s
and fcvtzs v0.4s, v0.4s
.
Try it for yourself.