Efficient integer floor function in C++

Question

I want to define an efficient integer floor function, i.e. a conversion from float or double that performs truncation towards minus infinity.

We can assume that the values are such that no integer overflow occurs. So far I have a few options

casting to int; this requires special handling of the negative values, as the cast truncates toward zero;
```
I= int(F); if (I < 0 && I != F) I--;
```
casting the result of floor to int;
```
int(floor(F));
```
casting to int with a large shift to get positives (this can return wrong results for large values);
```
int(F + double(0x7fffffff)) - 0x7fffffff;
```

Casting to int is notoriously slow. So are if tests. I have not timed the floor function, but seen posts claiming it is also slow.

Can you think of better alternatives in terms of speed, accuracy or allowed range ? It doesn't need to be portable. The targets are recent x86/x64 architectures.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/194348/discussion-on-question-by-yves-daoust-efficient-integer-floor-function-in-c). — Samuel Liew, Jun 02 '19 at 23:23
@ilkkachu: read the comments and answers. There are alternatives. — , Jun 03 '19 at 17:41

Peter Cordes · Answer 1 · 2019-06-04T04:20:48.400

Casting to int is notoriously slow.

Maybe you've been living under a rock since x86-64, or otherwise missed that this hasn't been true for a while on x86. :)

SSE/SSE2 have an instruction to convert with truncation (instead of the default rounding mode). The ISA supports this operation efficiently precisely because conversion with C semantics is not rare in actual codebases. x86-64 code uses SSE/SSE2 XMM registers for scalar FP math, not x87, because of this and other things that make it more efficient. Even modern 32-bit code uses XMM registers for scalar math.

When compiling for x87 (without SSE3 fisttp), compilers used to have to change the x87 rounding mode to truncation, FP store to memory, then change the rounding mode back again. (And then reload the integer from memory, typically from a local on the stack, if doing further stuff with it.) x87 was terrible for this.

Yes that was horribly slow, e.g. in 2006 when the link in @Kirjain's answer was written, if you still had a 32-bit CPU or were using an x86-64 CPU to run 32-bit code.

Converting with a rounding mode other than truncation or default (nearest) isn't directly supported, and until SSE4.1 roundps/roundpd your best bet was magic-number tricks like in the 2006 link from @Kirjain's answer.

Some nice tricks there, but only for double -> 32-bit integer. Unlikely to be worth expanding to double if you have float.

Or more usually, simply add a large-magnitude number to trigger rounding, then subtract it again to get back to the original range. This can work for float without expanding to double, but I'm not sure how easy it is to make floor work.

Anyway, the obvious solution here is _mm256_floor_ps() and _mm256_cvtps_epi32 (vroundps and vcvtps2dq). A non-AVX version of this can work with SSE4.1.

I'm not sure if we can do even better; If you had a huge array to process (and couldn't manage to interleave this work with other work), you could set the MXCSR rounding mode to "towards -Inf" (floor) and simply use vcvtps2dq (which uses the current rounding mode). Then set it back. But it's probably better to cache-block your conversion or do it on the fly as you generate the data, presumably from other FP calculations that need the FP rounding mode set to the default Nearest.

roundps/pd/ss/sd is 2 uops on Intel CPUs, but only 1 uop (per 128-bit lane) on AMD Ryzen. cvtps2dq is also 1 uop. packed double->int conversion also includes a shuffle. Scalar FP->int conversion (that copies to an integer register) usually also costs an extra uop for that.

So there's room for the possibility of magic-number tricks being a win in some cases; it maybe worth investigating if _mm256_floor_ps() + cvt are part of a critical bottleneck (or more likely if you have double and want int32).

@Cássio Renan's int foo = floorf(f) will actually auto-vectorize if compiled with gcc -O3 -fno-trapping-math (or -ffast-math), with -march= something that has SSE4.1 or AVX. https://godbolt.org/z/ae_KPv

That's maybe useful if you're using this with other scalar code that's not manually vectorized. Especially if you're hoping the compiler will auto-vectorize the whole thing.

This is a very informative answer, although the question about living under a rock might come across as harsher than you intended. — Davislor, Jun 03 '19 at 05:17
@Davislor: I did intend it to be somewhat harsh, and a wakeup call that the "knowledge" thought they had was way out of date and in need of re-examining. Yves has posted some x86 intrinsics vectorization questions before, so I was surprised that he didn't know how compilers use SSE2 for FP math. — Peter Cordes, Jun 03 '19 at 05:25
@Davislor: *If he already knew* - no, that's not the whole question. C cast to int gives you truncation toward 0, but Yves wants `floor`. There's 100% still a real question here, and knowing that C casts are efficient doesn't answer it. So I don't think it's *rude* or unkind, and makes the point with humour about relying on old performance "facts". I don't think most readers will take it as a real insult, so I think it's a fun way to introduce the fact that one of the assumptions is totally outdated. — Peter Cordes, Jun 03 '19 at 05:32
@PeterCordes: Yes, I have been living under a rock. For business reasons, I keep supporting down to VS 2008, and only recently dropped VS 2005. Requests for VB6 seem to calm down. :-) — , Jun 03 '19 at 10:10
@YvesDaoust: heh, welcome to ~2004, then, [when AMD64 K8 was released](https://en.wikipedia.org/wiki/AMD_K8). :) Back in 2004 I was a sysadmin for a research group (in phylogenetics), we had a few small Beowulf clusters of AMD Opterons running 64-bit GNU/Linux. In my mind, 32-bit has been obsolete since then, especially for any kind of number crunching. Yes I know some people still bother with 32-bit on Windows sometimes, but these days you can basically assume SSE2 even in 32-bit code. (I maybe didn't know that it made float->int casting cheaper specifically, though.) — Peter Cordes, Jun 03 '19 at 10:21
@PeterCordes: back then, I wrote tons of inline-assembly MMX. These days, maintaining vectorized code under all Intel SIMD flavors plus ARM NEON is a little nightmare. (In my case, auto-vectorization and Intel HPP are not enough.) — , Jun 03 '19 at 10:29

borizzzzz · Accepted Answer · 2019-06-02T17:29:45.927

18

Have a look at magic numbers. The algorithm proposed on the web page should be far more efficient than simple casting. I've never used it myself, but this is the performance comparison they offer on the site (xs_ToInt and xs_CRoundToInt are the proposed functions):

Performing 10000000 times:
simple cast           2819 ms i.e. i = (long)f;
xs_ToInt              1242 ms i.e. i = xs_ToInt(f); //numerically same as above
bit-twiddle(full)     1093 ms i.e. i = BitConvertToInt(f); //rounding from Fluid
fistp                  676 ms i.e. i = FISTToInt(f); //Herf, et al x86 Assembly rounding 
bit-twiddle(limited)   623 ms i.e. i = FloatTo23Bits(f); //Herf, rounding only in the range (0...1]  
xs_CRoundToInt         609 ms i.e. i = xs_CRoundToInt(f); //rounding with "magic" numbers

Further, the xs_ToInt is apparently modified so that the performance improves:

Performing 10000000 times:
simple cast convert   3186 ms i.e. fi = (f*65536);
fistp convert         3031 ms i.e. fi = FISTToInt(f*65536);
xs_ToFix               622 ms i.e. fi = xs_Fix<16>::ToFix(f);

Brief explanation of how the 'magic numbers' method works:

"Basically, in order to add two floating point numbers, your processor "lines up" the decimal points of the numbers so that it can easily add the bits. It does this by "normalizing" the numbers such that the most significant bits are preserved, i.e. the smaller number "normalizes" to match the bigger one. So the principle of the "magic number" conversion that xs_CRoundToInt() uses is this: We add a big enough floating point number (a number that is so big that there are significant digits only UP TO the decimal point, and none after it) to the one you're converting such that: (a) the number gets normalized by the processor to its integer equivalent and (b) adding the two does not erase the integral significat bits in the number you were trying to convert (i.e. XX00 + 00YY = XXYY)."

The quote is taken from the same web page.

edited Jun 02 '19 at 17:29

answered Jun 02 '19 at 17:24

borizzzzz

620
1
6
17

This seems to be very close to what I need. I should try to compare against the intrinsics... – Jun 02 '19 at 18:11
22

Those times are quoted straight from the article, with a 13 year old compiler making 32-bit code for a Pentium 4. That's a very good article, but only parts of it are still relevant today on x86-64 where scalar FP math is done in XMM regs with SSE2, not x87. If `simple_cast` is slower than SSSE3 `fisttp` or baseline x87 `fistp`, your x86-64 compiler is broken. – Peter Cordes Jun 02 '19 at 20:42
@YvesDaoust: Magic-number tricks and grabbing bits from the FP bit-pattern could potentially be useful on modern x86-64 to implement floor-conversion, but probably only if you don't have SSE4.1 for `roundps`. And the implementation in your link appears to need `double` to get a full 32 bit integer as the low 32 bits of the mantissa. But yes then it's possibly useful, maybe only taking one FP `double` add uop to create the right FP bit pattern in the low 32 bits of an XMM register for scalar, instead of 2 uops for `roundsd` + another 1 or 2 for `cvtsd2si`. – Peter Cordes Jun 02 '19 at 20:53
3

But if you have packed `float`s, expanding them to `double` is probably not a win. – Peter Cordes Jun 02 '19 at 20:53
8

And BTW, the code in the linked article is not portable C++. It has strict-aliasing UB and is only safe on MSVC. Use `memcpy` to type-pun. (Or on all x86-64 compilers, a union is also supported, unlike pointer-casting.) – Peter Cordes Jun 02 '19 at 21:07
@PeterCordes A named union with a proper member access: `u.X = 1; +u.Y;` not pointers to union members; `int *x = u.X; float *y = u.Y; *x = 1; +*y;` (well maybe in this case it's OK as it's visible what the pointers point to, but not in general as the compiler might not understand) – curiousguy Jun 02 '19 at 22:12
3

@curiousguy: yes of course you have to access union members by value, not by aliasing pointers. And yes strict-aliasing violations do happen to sometimes not break on GCC/clang; yes the visibility of the pointer targets might happen to make it work with current GCC versions, but not future-proof. But anyway, type-punning is basically a solved problem that should be wrapped in a function; you can find portable safe definitions that compile efficiently. (Although this special case of grabbing the low 32 bits of a 64-bit `double` there's an efficiency concern, especially for 32-bit code.) – Peter Cordes Jun 02 '19 at 22:25
7

Unfortunately this is *essentially* a link-only answer since it doesn’t include any code. Please also include the relevant code (at least of the best-performing answer) to prevent link rot. – Konrad Rudolph Jun 03 '19 at 09:58

Not a real meerkat · Answer 3 · 2019-06-03T11:45:54.930

4

If you're doing this in batch, the compiler may autovectorize it, if you know what you're doing. For example, here is an small implementation that autovectorizes the conversion of floats to integers, on GCC:

#include <cmath>

// Compile with -O3 and -march=native to see autovectorization
__attribute__((optimize("-fno-trapping-math")))
void testFunction(float* input, int* output, int length) {
  // Assume the input and output are aligned on a 32-bit boundary.
  // Of course, you have  to ensure this when calling testFunction, or else
  // you will have problems.
  input = static_cast<float*>(__builtin_assume_aligned(input, 32));
  output = static_cast<int*>(__builtin_assume_aligned(output, 32));

  // Also assume the length is a multiple of 32.
  if (length & 31) __builtin_unreachable();

  // Do the conversion
  for (int i = 0; i < length; ++i) {
    output[i] = floor(input[i]);
  }
}

This is the generated assembly for x86-64 (With AVX512 instructions):

testFunction(float*, int*, int):
        test    edx, edx
        jle     .L5
        lea     ecx, [rdx-1]
        xor     eax, eax
.L3:
        # you can see here that the conversion was vectorized
        # to a vrndscaleps (that will round the float appropriately)
        # and a vcvttps2dq (thal will perform the conversion)
        vrndscaleps     ymm0, YMMWORD PTR [rdi+rax], 1
        vcvttps2dq      ymm0, ymm0
        vmovdqa64       YMMWORD PTR [rsi+rax], ymm0
        add     rax, 32
        cmp     rax, rdx
        jne     .L3
        vzeroupper
.L5:
        ret

If your target doesn't support AVX512, it will still autovectorize using SSE4.1 instructions, assuming you have those. This is the output with -O3 -msse4.1:

testFunction(float*, int*, int):
        test    edx, edx
        jle     .L1
        shr     edx, 2
        xor     eax, eax
        sal     rdx, 4
.L3:
        roundps xmm0, XMMWORD PTR [rdi+rax], 1
        cvttps2dq       xmm0, xmm0
        movaps  XMMWORD PTR [rsi+rax], xmm0
        add     rax, 16
        cmp     rax, rdx
        jne     .L3
.L1:
        ret

See it live on godbolt

edited Jun 03 '19 at 11:45

answered Jun 02 '19 at 17:52

Not a real meerkat

5,604
1
24
55

1

I am even able to do explicit vectorization (the numbers are not consecutive). An important property is that the truncation direction be correct. – Jun 02 '19 at 17:58
1

@YvesDaoust of course. I adjusted the code to do the flooring, as you expect. You can see that the compiler is pretty smart and was able to vectorize that too. Maybe you can pack the floats into a memory-aligned array, then call this function, and then unpack them back into wherever you want them to be. It can still be faster than working on them one by one, even if you have a faster algorithm. Try and measure to see if it works (I can't do it, because I can't really guess what your machine/architecture looks like) – Not a real meerkat Jun 02 '19 at 18:10
Thanks for your effort. I have to choose between two good answers now... – Jun 02 '19 at 18:12
1

`vrndscaless`: compiling with AVX512F enabled seems a bit unusual and worth mentioning!!! (With `-march=skylake` you get a normal AVX `vroundss` https://godbolt.org/z/k5NuI3). But more importantly, **this is not auto-vectorized**. You have a loop that uses Scalar Single (`...ss`) instructions to do 1 float at a time. I'm not sure *why* gcc is failing to auto-vectorize. **It will with `-fno-trapping-math`**, so I guess that means it thinks it need to handle precise exceptions that let you figure out which exact element of the array raised an FP exception. https://godbolt.org/z/9kWwIJ – Peter Cordes Jun 02 '19 at 20:14
@PeterCordes you're right! I'll do some research and try to improve my answer in a few minutes. – Not a real meerkat Jun 02 '19 at 20:23
@PeterCordes looks like [`floor()` may raise `FE_INEXACT`](https://en.cppreference.com/w/cpp/numeric/math/floor#Notes), although it is not required to. I feel that simply enabling `-fno-trapping-math`(thanks for the information about this option, BTW!) within the function is safe enough. Please correct me if I'm wrong about this. I tried assuming that FE_INEXACT is unreachable (using `fetestexcept` and `__builtin_unreachable`), but that didn't work (it also could be dangerous, as the asumption would be incorrect). I updated the answer with the improved example. – Not a real meerkat Jun 02 '19 at 21:26
Your answer still fails to mention the necessary compile options. And you're still showing AVX512 asm output. Don't use `-march=native` on Godbolt; that's a moving target. In 5 or 10 years, it will be some future ISA instead of the current `-march=skylake-avx512`. Anyway, I thought GCC already assumed that FP exceptions are masked (except maybe signalling NaN?), so maybe it's a missed optimization that *should* be done even with trapping math. GCC will auto-vectorize normal FP math like `a[i] = b[i]+c[i]` without any math opts, so it can't be that simple. – Peter Cordes Jun 02 '19 at 21:37
[What does gcc -fno-trapping-math do?](//stackoverflow.com/q/50374771) indicates that trapping-math stops FP math operations from being optimized away in case something reads the FP status flags to detect which exceptions happened. That sounds right. And `roundps` specifically suppresses precision/inexect exceptions (https://www.felixcloutier.com/x86/roundpd), so with trapping-math the asm has to set the exception-flag in cases where the C source would, and not when it wouldn't. (But setting it multiple times is ok.) But if the scalar is ok, vectorized should be ok, too: missed optimization – Peter Cordes Jun 02 '19 at 21:42
@PeterCordes instead of enabling `-fno-trapping-math` globally, I instead added it as an attribute on the actual function where it is needed: This is the less intrusive option, IMHO. That's why the compiler flags are still the same. You're correct that I shouldn't use `-march=native` on godbolt, but the point here is that this should work on any architecture where there are SIMD instructions available for vectorization, regardless of what these instructions actually are. – Not a real meerkat Jun 02 '19 at 23:52
`-fno-trapping-math` is extremely likely to be fine globally, like `-fno-math-errno` which allows `sqrt` to inline fully. IIRC, `__attribute__((optimize("")))` is not recommended for production use in the GCC manual. – Peter Cordes Jun 03 '19 at 04:39
1

That explanation about `-march=native` sort of makes some sense as a recommendation to *use* that, but makes no sense for the asm you quote in your answer. You don't say anything about it still working *without* AVX512. Maybe say "requires `-march=native` on a CPU with SSE4.1", and show asm output for `-`march=nehalem` or core2 + `-msse4.1`. Just showing the `-march=skylake-avx512` output and saying "This is the generated assembly for x86-64" is extremely misleading, because that's *not* baseline-x86-64. It does require SSE4.1, which not ever project can enable for binaries they distribute – Peter Cordes Jun 03 '19 at 04:41
This answer was so nice, I modified it a bit and posted my own comparison. – Davislor Jun 03 '19 at 07:44
@PeterCordes I've updated the answer to show what the output looks like with only SSE4.1. I've also updated the godbolt example to compile with `-mavx512f` instead of `-march=native`. I didn't feel comfortable choosing a specific architecture (Again, I don't know what the actual target is, except that it is a "recent x86/x64 architecture") so I opted for the instruction set flag instead. BTW, thanks a lot for your suggestions! If it isn't obvious already, I'm learning a lot from writing (and revising) this answer. – Not a real meerkat Jun 03 '19 at 11:52

user1095108 · Answer 4 · 2019-06-02T21:44:57.200

2

Why not just use this:

#include <cmath>

auto floor_(float const x) noexcept
{
  int const t(x);

  return t - (t > x);
}

edited Jun 02 '19 at 21:44

answered Jun 02 '19 at 21:23

user1095108

14,119
9
58
116

This isn't efficient. – 1201ProgramAlarm Jun 02 '19 at 21:31
Why not? Because that doesn't compile as efficiently as `floorf` on x86-64. Without SSE4.1 (like you linked) it has to do a bunch of work to implement `trunc` without conversion to limited-range integer. And even with `-march=nehalem` it branches. Doing the conditional sub after conversion to integer would probably help. Also, your function returns a `float`, not `int`. so it doesn't even satisfy the requirement of the question. – Peter Cordes Jun 02 '19 at 21:31
1

tried to fix it now. No more branching under nehalem. – user1095108 Jun 02 '19 at 21:45
This might be non-terrible when auto-vectorizing, but for scalar it's not great. It converts back from int to float which costs 2 uops for `cvtsi2ss` on Intel. A SIMD version using packed-conversion could be nice, though, if you can't assume SSE4.1. (Packed conversion to integer and back is only 2 uops total, and SIMD compare produces a 0 / -1 integer bit-pattern that you can simply `addps`.) Still not as efficient as SSE4.1 `roundps` + `cvtps2dq` though. – Peter Cordes Jun 02 '19 at 21:53
1

Well, the OP missed that fact, that an `if` is not really needed, so maybe the answer is worthwhile. – user1095108 Jun 02 '19 at 22:20

Davislor · Answer 5 · 2019-06-03T20:45:18.980

Here’s a modification of Cássio Renan’s excellent answer. It replaces all compiler-specific extensions with standard C++ and is, in theory, portable to any conforming compiler. In addition, it checks that the arguments are properly aligned rather than assuming so. It optimizes to the same code.

#include <assert.h>
#include <cmath>
#include <stddef.h>
#include <stdint.h>

#define ALIGNMENT alignof(max_align_t)
using std::floor;

// Compiled with: -std=c++17 -Wall -Wextra -Wpedantic -Wconversion -fno-trapping-math -O -march=cannonlake -mprefer-vector-width=512

void testFunction(const float in[], int32_t out[], const ptrdiff_t length)
{
  static_assert(sizeof(float) == sizeof(int32_t), "");
  assert((uintptr_t)(void*)in % ALIGNMENT == 0);
  assert((uintptr_t)(void*)out % ALIGNMENT == 0);
  assert((size_t)length % (ALIGNMENT/sizeof(int32_t)) == 0);

  alignas(ALIGNMENT) const float* const input = in;
  alignas(ALIGNMENT) int32_t* const output = out;

  // Do the conversion
  for (int i = 0; i < length; ++i) {
    output[i] = static_cast<int32_t>(floor(input[i]));
  }
}

This doesn’t optimize quite as nicely on GCC as the original, which used non-portable extensions. The C++ standard does support an alignas specifier, references to aligned arrays, and a std::align function that returns an aligned range within a buffer. None of these, however, make any compiler I tested generate aligned instead of unaligned vector loads and stores.

Although alignof(max_align_t) is only 16 on x86_64, and it is possible to define ALIGNMENT as the constant 64, this doesn’t help any compiler generate better code, so I went for portability. The closest thing to a portable way to force the compiler to assume a poitner is aligned would be to use the types from <immintrin.h>, which most compilers for x86 support, or define a struct with an alignas specifier. By checking predefined macros, you could also expand a macro to __attribute__ ((aligned (ALIGNMENT))) on Linux compilers, or __declspec (align (ALIGNMENT)) on Windows compilers, and something safe on a compiler we don’t know about, but GCC needs the attribute on a type to actually generate aligned loads and stores.

Additionally, the original example called a bulit-in to tell GCC that it was impossible for length not to be a multiple of 32. If you assert() this or call a standard function such as abort(), neither GCC, Clang nor ICC will make the same deduction. Therefore, most of the code they generate will handle the case where length is not a nice round multiple of the vector width.

A likely reason for this is that neither optimization get you that much speed: unaligned memory instructions with aligned addresses are fast on Intel CPUs, and the code to handle the case where length is not a nice round number is a few bytes long and runs in constant time.

As a footnote, GCC is able to optimize inline functions from <cmath> better than the macros implemented in <math.c>.

GCC 9.1 needs a particular set of options to generate AVX512 code. By default, even with -march=cannonlake, it will prefer 256-bit vectors. It needs the -mprefer-vector-width=512 to generate 512-bit code. (Thanks to Peter Cordes for pointing this out.) It follows up the vectorized loop with unrolled code to convert any leftover elements of the array.

Here’s the vectorized main loop, minus some constant-time initialization, error-checking and clean-up code that will only run once:

.L7:
        vrndscaleps     zmm0, ZMMWORD PTR [rdi+rax], 1
        vcvttps2dq      zmm0, zmm0
        vmovdqu32       ZMMWORD PTR [rsi+rax], zmm0
        add     rax, 64
        cmp     rax, rcx
        jne     .L7

The eagle-eyed will notice two differences from the code generated by Cássio Renan’s program: it uses %zmm instead of %ymm registers, and it stores the results with an unaligned vmovdqu32 rather than an aligned vmovdqa64.

Clang 8.0.0 with the same flags makes different choices about unrolling loops. Each iteration operates on eight 512-bit vectors (that is, 128 single-precision floats), but the code to pick up leftovers is not unrolled. If there are at least 64 floats left over after that, it uses another four AVX512 instructions for those, and then cleans up any extras with an unvectorized loop.

If you compile the original program in Clang++, it will accept it without complaint, but won’t make the same optimizations: it will still not assume that the length is a multiple of the vector width, nor that the pointers are aligned.

It prefers AVX512 code to AVX256, even without -mprefer-vector-width=512.

        test    rdx, rdx
        jle     .LBB0_14
        cmp     rdx, 63
        ja      .LBB0_6
        xor     eax, eax
        jmp     .LBB0_13
.LBB0_6:
        mov     rax, rdx
        and     rax, -64
        lea     r9, [rax - 64]
        mov     r10, r9
        shr     r10, 6
        add     r10, 1
        mov     r8d, r10d
        and     r8d, 1
        test    r9, r9
        je      .LBB0_7
        mov     ecx, 1
        sub     rcx, r10
        lea     r9, [r8 + rcx]
        add     r9, -1
        xor     ecx, ecx
.LBB0_9:                                # =>This Inner Loop Header: Depth=1
        vrndscaleps     zmm0, zmmword ptr [rdi + 4*rcx], 9
        vrndscaleps     zmm1, zmmword ptr [rdi + 4*rcx + 64], 9
        vrndscaleps     zmm2, zmmword ptr [rdi + 4*rcx + 128], 9
        vrndscaleps     zmm3, zmmword ptr [rdi + 4*rcx + 192], 9
        vcvttps2dq      zmm0, zmm0
        vcvttps2dq      zmm1, zmm1
        vcvttps2dq      zmm2, zmm2
        vmovups zmmword ptr [rsi + 4*rcx], zmm0
        vmovups zmmword ptr [rsi + 4*rcx + 64], zmm1
        vmovups zmmword ptr [rsi + 4*rcx + 128], zmm2
        vcvttps2dq      zmm0, zmm3
        vmovups zmmword ptr [rsi + 4*rcx + 192], zmm0
        vrndscaleps     zmm0, zmmword ptr [rdi + 4*rcx + 256], 9
        vrndscaleps     zmm1, zmmword ptr [rdi + 4*rcx + 320], 9
        vrndscaleps     zmm2, zmmword ptr [rdi + 4*rcx + 384], 9
        vrndscaleps     zmm3, zmmword ptr [rdi + 4*rcx + 448], 9
        vcvttps2dq      zmm0, zmm0
        vcvttps2dq      zmm1, zmm1
        vcvttps2dq      zmm2, zmm2
        vcvttps2dq      zmm3, zmm3
        vmovups zmmword ptr [rsi + 4*rcx + 256], zmm0
        vmovups zmmword ptr [rsi + 4*rcx + 320], zmm1
        vmovups zmmword ptr [rsi + 4*rcx + 384], zmm2
        vmovups zmmword ptr [rsi + 4*rcx + 448], zmm3
        sub     rcx, -128
        add     r9, 2
        jne     .LBB0_9
        test    r8, r8
        je      .LBB0_12
.LBB0_11:
        vrndscaleps     zmm0, zmmword ptr [rdi + 4*rcx], 9
        vrndscaleps     zmm1, zmmword ptr [rdi + 4*rcx + 64], 9
        vrndscaleps     zmm2, zmmword ptr [rdi + 4*rcx + 128], 9
        vrndscaleps     zmm3, zmmword ptr [rdi + 4*rcx + 192], 9
        vcvttps2dq      zmm0, zmm0
        vcvttps2dq      zmm1, zmm1
        vcvttps2dq      zmm2, zmm2
        vcvttps2dq      zmm3, zmm3
        vmovups zmmword ptr [rsi + 4*rcx], zmm0
        vmovups zmmword ptr [rsi + 4*rcx + 64], zmm1
        vmovups zmmword ptr [rsi + 4*rcx + 128], zmm2
        vmovups zmmword ptr [rsi + 4*rcx + 192], zmm3
.LBB0_12:
        cmp     rax, rdx
        je      .LBB0_14
.LBB0_13:                               # =>This Inner Loop Header: Depth=1
        vmovss  xmm0, dword ptr [rdi + 4*rax] # xmm0 = mem[0],zero,zero,zero
        vroundss        xmm0, xmm0, xmm0, 9
        vcvttss2si      ecx, xmm0
        mov     dword ptr [rsi + 4*rax], ecx
        add     rax, 1
        cmp     rdx, rax
        jne     .LBB0_13
.LBB0_14:
        pop     rax
        vzeroupper
        ret
.LBB0_7:
        xor     ecx, ecx
        test    r8, r8
        jne     .LBB0_11
        jmp     .LBB0_12

ICC 19 also generates AVX512 instructions, but very different from clang. It does more set-up with magic constants, but does not unroll any loops, operating instead on 512-bit vectors.

This code also works on other compilers and architectures. (Although MSVC only supports the ISA up to AVX2 and cannot auto-vectorize the loop.) On ARM with -march=armv8-a+simd, for example, it generates a vectorized loop with frintm v0.4s, v0.4s and fcvtzs v0.4s, v0.4s.

Try it for yourself.

`assert((uintptr_t)(void*)in % alignof(max_align_t) == 0);` is very different from `__builtin_assume_aligned`. If you build with `-DNDEBUG`, the assert will go away and *not* tell the compiler that your data is actually aligned. Or without `NDEBUG` it actually emits instructions to check alignment at runtime. But `__builtin_assume_aligned` is a *promise* to the compiler that your data is aligned. It's basically unnecessary with AVX in this case, because alignment doesn't enable better autovec. (As long as you're using gcc8 or later; earlier GCC will check for alignment and reach a boundary — Peter Cordes, Jun 03 '19 at 08:16
@PeterCordes It’s true that this code will perform worse if compiled with `-DNDEBUG`. It optimizes just as well by default, and is more portable, with a more useful error message. Feel free to use whatever coding style you prefer I’ll make a note. — Davislor, Jun 03 '19 at 08:18
Your way is a valid choice; I was objecting to the implication that this rewrite with plain ISO C++ was a simple port. Your update fixes that. But anyway, no, `-DNDEBUG` will *not* make it slower, with clang or gcc8/9. https://godbolt.org/z/-6xKOG shows that leaving out the asserts with clang8 merely leaves out some cmp/branch at the start, with the actual work being identical in both the main loop and cleanup. You only promised 16-byte alignment anyway, so clang was already doing unaligned loads/stores (which have no extra cost if the data is aligned at runtime.) — Peter Cordes, Jun 03 '19 at 08:26
GCC's choice of `-mprefer-vector-width=256` is appropriate for programs that don't benefit a huge amount from auto-vectorization; reducing the max turbo clock speed isn't worth it if only a few rare loops benefit. Use 512 if you want gcc to auto-vectorize with 512-bit vectors. Not unrolling the main loop but fully unrolling the cleanup is probably not a good decision, though, which gcc unfortunately almost always does. (But remember that using 512-bit vectors leads to more cleanup cost, too. And even more bloated cleanup because gcc is dumb.) — Peter Cordes, Jun 03 '19 at 08:33
@PeterCordes The alignment guarantee should be from the `alignas` specifiers on `input` and `output`; changing those to 32 doesn’t change the generated code. Testing on GodBolt shows that, even with `__builtin_assume_aligned`, Clang still generates unaligned `vmovups` instructions. — Davislor, Jun 03 '19 at 08:34
@PeterCordes Ah! `-mprefer-vector-width` is what I was missing. — Davislor, Jun 03 '19 at 08:34
Oh, I missed the alignas on the pointers. But that doesn't do what you want; you're telling the compiler that the pointer object itself is aligned. You can `volatile` add to the pointers to see that they get spilled to 16-byte aligned stack addresses (`rsp-8` and `rsp-24`), but without alignas are contiguous. https://godbolt.org/z/SVWWmH So we can tell that it's only affecting the pointer object, not the pointed-to data. — Peter Cordes, Jun 03 '19 at 08:43
I don't know a portable way to promise alignment to the compiler in ISO C++ :( Maybe with a cast to an array type? When I've tried stuff, I ended up with an array of floats with *each* element padded to 16 bytes, not just the start of the array. (Like with a typedef to define an aligned-float type). ISO C++ does let you `alignas()` an *array*, though, without expanding each element. — Peter Cordes, Jun 03 '19 at 08:43
@PeterCordes i *think* I have it, but I can’t seem to get aligned loads and stores with the builtin, either. So I think my problem is elsewhere. — Davislor, Jun 03 '19 at 09:47
Cássio Renan's answer shows how to use `__builtin_assume_aligned` correctly. — Peter Cordes, Jun 03 '19 at 10:24
@PeterCordes Yeah, it was finicky and Clang++ doesn’t generate aligned loads with it, although it will compile the code. Similarly, the `__builtin_unreachable()` trick isn’t effective at eliminating the check for stragglers. I have standard syntax that I think should work, but it doesn’t generate aligned loads on any compiler in practice. — Davislor, Jun 03 '19 at 10:33
@PeterCordes After a lot of experimentation, I found a portable way to get an aligned block of memory and a different way to get GCC to generate aligned loads and stores, but Clang and ICC don’t seem to want to. — Davislor, Jun 03 '19 at 12:45

Efficient integer floor function in C++

5 Answers5

See it live on godbolt

Linked