Efficient SSE FP `floor()` / `ceil()` / `round()` Rounding Functions Without SSE4.1?

Question

How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?

Round - roundf()
Ceil - ceilf() or SSE4.1 _mm_ceil_ps.
Floor - floorf() or SSE4.1 _mm_floor_ps.

I need to do this without SSE4.1 roundps (_mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC). roundps can also truncate toward zero, but I don't need that for this application.

I can use SSE3 and earlier. (No SSSE3 or SSE4)

So the function declaration would be something like:

__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x ) and __m128 FloorSse( __m128 x ).

Starting point would be https://www.reddit.com/r/programming/comments/1p2yys/sse3_optimized_vector_floor_ceil_round_and_mod/. Though it uses By Reference instead of By Value. The code there also yields issues with VS 2017. — Royi, Jan 03 '19 at 12:43
Are you sure you actually *need* `round`, and the IEEE default rounding mode wouldn't work for you? `rintf` or `nearbyintf` give you that, while `round` uses a special rounding mode that x86 doesn't have in hardware. [round() for float in C++](https://stackoverflow.com/a/47347224). (It can be emulated with a few instructions on top of `roundps` if you have SSE4.1, so if you can emulate `roundps` to nearest you can still emulate `round()`, but it's probably going to be slower.) — Peter Cordes, Jan 03 '19 at 12:53
Google for "sse_mathfun" - it's a useful library which includes the above functions and many others. — Paul R, Jan 03 '19 at 13:12
@PaulR ,I'd rather have an implementation here so I will understand how it works. Also will be great reference for other who search for it. — Royi, Jan 03 '19 at 17:02
Right, but why are you asking for a vectorized `roundf` instead of a vectorized `rintf`? If you want to understand how to implement `round()` in terms of floor/ceil, see @njuffa's scalar code on [round() for float in C++](https://stackoverflow.com/a/47329084). It's easy enough to handle those conditionals branchlessly with SIMD, similar to hose they're written with ternary operators. Or [this deleted answer](https://stackoverflow.com/a/485549/224132) might be good, or the accepted answer is very simple but breaks for inputs like 0.49999999999999994. — Peter Cordes, Jan 03 '19 at 17:14
I think asking for `rintf` makes more sense here, but I guess you could ask for `roundf` *as well*. — Peter Cordes, Jan 03 '19 at 17:15
I think the blog you linked actually does implement `round` (away from 0 on tiebreak), because it's based on truncation from converting to integer and back. It calls it `_mm_round_ps2`, which is confusing because `_mm_round_ps(NEAREST)` implements `rintf`, not ISO C `round`. If you can assume your numbers are in a range where converting to int is save, `rint` is trivial to implement with packed-conversion to/from integer with default rounding, not truncation. — Peter Cordes, Jan 03 '19 at 17:45
See [DirectXMath](https://github.com/Microsoft/DirectXMath), specifically the implementation of ``XMVectorRound``, ``XMVectorFloor``, and ``XMVectorCeil`` in [this source file](https://github.com/Microsoft/DirectXMath/blob/master/Inc/DirectXMathVector.inl) — Chuck Walbourn, Jan 03 '19 at 17:45
@PeterCordes, The problem with the blog I linked the code won't work on VS 2017. I think also the code works on By Reference. — Royi, Jan 03 '19 at 18:01
It takes an arg by const-reference to work around MSVC limitations, and returns a value. That's totally normal and totally trivial to change, and goes away when inlining. If it's not portable to MSVC, then that should be easy to fix by cleaning up some of the weird choices (like pointer casting instead of `_mm_cast...`). — Peter Cordes, Jan 03 '19 at 18:06
@PeterCordes, I will post their code as WIKI. Could you assist optimizing it? — Royi, Jan 03 '19 at 18:11
How much do you care about corner cases such as `inf`, `nan`, as well as correct handling of values exactly halfway between integers (i.e. `x.5`)? — Mysticial, Apr 25 '19 at 15:41
I don't care about `nan` or `inf`. Care to elaborate on the `x.5` case? In `ceil()` and `floor()` I need to address it correctly. What do you suggest in the `round()` case? — Royi, Apr 25 '19 at 16:05
@Royi The `x.5` case is all the round-to-even stuff and whether 2.5 should be 2.0 or 3.0. C++ and IEEE specify it differently thus C++ `round()` is not allowed to use SSE4.1 round instruction without ffast-math. For single precision, adding and subtracting `3*2^22` will round the number to the nearest integer. But it may fail for things outside of the +/- `2^22` range. You also have no control over the `x.5` cases. It will also fail with ffast-math since the compiler will simply optimize it to a no-op. — Mysticial, Apr 25 '19 at 17:25

Royi · Answer 1 · 2019-01-03T19:55:35.607

1

I'm posting the code from http://dss.stephanierct.com/DevBlog/?p=8:

It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):

static inline __m128 FloorSse(const __m128 x) {
    __m128i v0 = _mm_setzero_si128();
    __m128i v1 = _mm_cmpeq_epi32(v0, v0);
    __m128i ji = _mm_srli_epi32(v1, 25);
    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
    __m128i i = _mm_cvttps_epi32(x);
    __m128 fi = _mm_cvtepi32_ps(i);
    __m128 igx = _mm_cmpgt_ps(fi, x);
    j = _mm_and_ps(igx, j);
    return _mm_sub_ps(fi, j);
}

static inline __m128 CeilSse(const __m128 x) {
    __m128i v0 = _mm_setzero_si128();
    __m128i v1 = _mm_cmpeq_epi32(v0, v0);
    __m128i ji = _mm_srli_epi32(v1, 25);
    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
    __m128i i = _mm_cvttps_epi32(x);
    __m128 fi = _mm_cvtepi32_ps(i);
    __m128 igx = _mm_cmplt_ps(fi, x);
    j = _mm_and_ps(igx, j);
    return _mm_add_ps(fi, j);
}

static inline __m128 RoundSse(const __m128 a) {
    __m128 v0 = _mm_setzero_ps();             //generate the highest value &lt; 2
    __m128 v1 = _mm_cmpeq_ps(v0, v0);
    __m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
    tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
    __m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
    __m128i i = _mm_cvttps_epi32(a);
    __m128 aTrunc = _mm_cvtepi32_ps(i);        // truncate a
    __m128 rmd = _mm_sub_ps(a, aTrunc);        // get remainder
    __m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
    __m128i rmd2i = _mm_cvttps_epi32(rmd2);    // after being truncated of course
    __m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
    __m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
    return r;
}


inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
    __m128 c = _mm_div_ps(a, aDiv);
    __m128i i = _mm_cvttps_epi32(c);
    __m128 cTrunc = _mm_cvtepi32_ps(i);
    __m128 base = _mm_mul_ps(cTrunc, aDiv);
    __m128 r = _mm_sub_ps(a, base);
    return r;
}

edited Jan 03 '19 at 19:55

answered Jan 03 '19 at 18:17

Royi

4,640
6
46
64

You left out the extra info from the blog that these aren't totally safe outside the range where converting to an integer and back works. The block has safe wrappers for those. According to the reddit discussions (https://www.reddit.com/r/programming/comments/1p2yys/sse3_optimized_vector_floor_ceil_round_and_mod/) this updated version of the code does actually work correctly for the whole range of inputs that Bruce Dawson tested. – Peter Cordes Jan 03 '19 at 18:22
I also tried to switch it into By Value form. I'm not sure I did it correctly. Also VS 2017 has some issues with few things in the code form the blog. I tried fixing it as well but again I'm not sure I did it correctly. This is why I put it as WIKI so people can edit it and improve it. If you want, post as an answer of yours. I will mark it and copy it into the Wiki. – Royi Jan 03 '19 at 18:27
Changing it to by-value is literally as easy as removing the `&`. There's no need to do that, though. What you should change is `*(__m128*)(&tmp)` should be `_mm_castsi128_ps(tmp)`, and similar. Those are a bad idea, even though `__m128` is a may-alias type. – Peter Cordes Jan 03 '19 at 18:52
@PeterCordes, I did removed the `&`. I just wasn't familiar with the effect of it in that context hence didn't know if it works. I also used the cast functions you recommended. What do you think now? – Royi Jan 03 '19 at 19:56
Oh, I forgot this was a C question. Are you not familiar with C++ references? It's no different from any other object, as long as your calling convention supports `__m128` args by value. Anyway, you should probably simplify the constants to replace the silly `cmpeq` and integer stuff with `_mm_set1_ps(1.0)` or `1.999999whatever`. The author of this code went overboard trying to outsmart the compiler and prevent it from loading constants from memory, but used `cmpps` instead of `pcmpeqd` to create an all-ones bit-pattern. It's questionable if it's worth generating these constants on the fly. – Peter Cordes Jan 03 '19 at 20:54
@PeterCordes, I don't see any `cmpps` or `pcmpeqd` in the code. Indeed I'm looking for `C` code and not `C++`. – Royi Jan 04 '19 at 08:00
`_mm_cmpeq_ps` is an intrinsic for `cmpps`. They're using it to do `0.0 == 0.0` to generate a vector of all ones, even less efficiently than with the integer equivalent. – Peter Cordes Jan 04 '19 at 08:40
Is the needed pattern has 4 values of 1 or 1 at the bit level? Could you edit the WIKI where you think it should? I prefer clarity over having better performance. – Royi Jan 04 '19 at 12:01

Efficient SSE FP `floor()` / `ceil()` / `round()` Rounding Functions Without SSE4.1?

1 Answers1

Linked