3

This can be done with sse4.1 intrinsics _mm_floor_pd and _mm_ceil_pd which translate into roundpd xmm,xmm,1 and roundpd xmm,xmm,2

What is the optimum way to calculate using SSE/SSE2/SSE3?

  • If your values won't overflow int32_t, you might be able to use packed-conversion to int and back. For non-negative values, trunc = floor so you could maybe save/restore the sign bit around `cvttpd2dq` / `cvtdq2pd`. Or it might be faster to add and subtract a huge positive (or negative) number, but I think that gets you nearest, not floor/ceil, without other tricks. – Peter Cordes Apr 15 '20 at 16:06
  • May be this can help you [Fast Rounding of Floating Point Numbers...](http://ldesoras.free.fr/doc/articles/rounding_en.pdf)? Also one can look at assemble output of modern compilers [floor/ceil at godbolt](https://godbolt.org/z/ajkQBk). – DevO Apr 15 '20 at 16:17
  • Related: [Efficient SSE FP \`floor()\` / \`ceil()\` / \`round()\` Rounding Functions Without SSE4.1?](https://stackoverflow.com/q/54022478) - that's single-precision and it converts to integer and back (so it's not even safe for the full range of float). Can you take advantage of any limited-range factors like magnitude < 2^53+1 or something? Related: [How to efficiently perform double/int64 conversions with SSE/AVX?](https://stackoverflow.com/q/41144668) has packed conversion to int64 / uint64 with bithacks. – Peter Cordes Apr 15 '20 at 16:29
  • What I meant to ask was precisely for doubles. Title edited – Iliugi Garner Apr 15 '20 at 17:11
  • 1
    [floor/ceil for double vector2 at godbolt](https://godbolt.org/z/RZZn7Z) – DevO Apr 15 '20 at 17:14
  • @DevO: interesting that GCC and MSVC inlines them, but note that they unpack to scalar so they can `cvttsd2si rax, xmm0`. Oh, but that's after checking the absolute value. If the magnitude isn't less than 2^52, no work is required because the double is definitely already an integer. That may be the best we can do without SSE4.1 for the general case, so if the OP wants to go faster they'll have to take advantage of restrictions for their use-case. – Peter Cordes Apr 15 '20 at 17:49
  • For `x>=0`, you can round to the nearest double by adding then subtracting `4503599627370496` (= `1LL<<52`). Directly computing ceil/floor by just adding/subtracting (something close to) 0.5 at any step is not possible, I think, because you may always end up at some split point which rounds to the non-intended direction. But you can compare the rounded result with the input value and add or don't add `+/- 1.0`. – chtz Apr 16 '20 at 06:20
  • how bad is this floor cvttpd2dq xmm1,xmm0 / cvtdq2pd xmm2,xmm1 / cmpneqpd xmm2,xmm0 / pxor xmm3,xmm3 / cmpltpd xmm0,xmm3 / andpd xmm2,xmm0 / paddq xmm2,xmm1 / cvtdq2ps xmm0,xmm2 – Iliugi Garner Apr 16 '20 at 06:52
  • @IliugiGarner [`cvttpd2dq`](https://www.felixcloutier.com/x86/cvttpd2dq) will fail for values larger than 2^31. Afaik, before AVX512 there is no direct packed conversion between `double` and `int64`. – chtz Apr 16 '20 at 09:22
  • @chtz of cause it will, my bad. Only way I could think of is scalar dp + shuffles then – Iliugi Garner Apr 16 '20 at 11:41
  • Do you actually need a generic solution, or do you know if your input values will be in some range? Do you need high throughput or low latency? Or small code size? – chtz Apr 16 '20 at 13:07
  • Looking for a generic solution. Preferences 1-Low latency 2-small code – Iliugi Garner Apr 16 '20 at 13:51

1 Answers1

1

Here is the code that do floor/ceil on pre SSE4.1 CPU. Please note that using '-ffast-math' will break it!

#include <cmath>
#include <emmintrin.h>
#include <cstdio> // for printf

#ifdef _MSC_VER
#define __attribute__(P)
#endif

struct vec2d {
    double x;
    double y;
};

static __m128d mm_blendv_pd(const __m128d& a, const __m128d& b, const __m128d& mask) noexcept {
  return _mm_or_pd(_mm_andnot_pd(mask, a), _mm_and_pd(b, mask));
}

__attribute__((optimize("-fno-associative-math")))
vec2d _floor(vec2d v) noexcept {
  __m128d src = _mm_set_pd(v.x, v.y);
  __m128d maxn = _mm_set_pd(4503599627370496.0, 4503599627370496.0);  // pow(2, 52)
  __m128d magic = _mm_set_pd(6755399441055744.0, 6755399441055744.0); // pow(2, 52) + pow(2, 51)
  __m128d msk = _mm_cmpnlt_pd(src, maxn);
  __m128d rounded = _mm_sub_pd(_mm_add_pd(src, magic), magic); //! -ffast-math will break this!
  __m128d maybeone = _mm_and_pd(_mm_cmplt_pd(src, rounded), _mm_set_pd(1.0, 1.0));
  __m128d res = mm_blendv_pd(_mm_sub_pd(rounded, maybeone), src, msk);
  return {_mm_cvtsd_f64(_mm_unpackhi_pd(res, res)), _mm_cvtsd_f64(res)};
}

__attribute__((optimize("-fno-associative-math")))
vec2d _ceil(vec2d v) noexcept {
  __m128d src = _mm_set_pd(v.x, v.y);
  __m128d maxn = _mm_set_pd(4503599627370496.0, 4503599627370496.0);  // pow(2, 52)
  __m128d magic = _mm_set_pd(6755399441055744.0, 6755399441055744.0); // pow(2, 52) + pow(2, 51)
  __m128d msk = _mm_cmpnlt_pd(src, maxn);
  __m128d rounded = _mm_sub_pd(_mm_add_pd(src, magic), magic); //! -ffast-math will break this!
  __m128d maybeone = _mm_and_pd(_mm_cmpnle_pd(src, rounded), _mm_set_pd(1.0, 1.0));
  __m128d res = mm_blendv_pd(_mm_add_pd(rounded, maybeone), src, msk);
  return {_mm_cvtsd_f64(_mm_unpackhi_pd(res, res)), _mm_cvtsd_f64(res)};
}


int main() {
    vec2d v{3.1,4.6};

    vec2d v2 = _floor(v);
    vec2d v3 = _ceil(v);

    printf(" v2: %f %f\n",v2.x,v2.y);
    printf(" v3: %f %f\n",v3.x,v3.y);

    return 0;
}

Live Code

It is based on this blog post C++ Compilers and FP Rounding on X86 , but code there has bugs.

DevO
  • 365
  • 1
  • 8