0

is there any simd that can ceil a float (round up) and than cast it to unsigned int without wrap? (i.e. any negative number become 0)?

maybe _mm_cvttps_epi32? not sure about rounding and wrap though (I think i got ub):

__m128 indices = _mm_mul_ps(phaseIncrement.v, mMinTopFreq.v);
indices = sse_mathfun_log_ps(indices) / std::log(2.0f);
__m128i finalIndices = _mm_cvttps_epi32(indices);

or any fancy way? I'm compiling with -O3 -march=nehalem -funsafe-math-optimizations -fno-omit-frame-pointer (so sse4.1 are allowed).

EDIT here's a version of the revisited code, after Peter Cordes's suggestions:

__m128 indexesFP = _mm_mul_ps(phaseIncrement.v, mMinTopFreq.v);
indexesFP = sse_mathfun_log_ps(indexesFP) * (1.0f / std::log(2.0f));
indexesFP = _mm_ceil_ps(indexesFP);
__m128i indexes = _mm_cvttps_epi32(indexesFP);
indexes = _mm_max_epi32(indexes, _mm_set_epi32(0, 0, 0, 0));
for (int i = 0; i < 4; i++) {
    int waveTableIndex = _mm_extract_epi32(indexes, i);
    waveTablesData[i] = &mWaveTables[waveTableIndex];
}

any improvements that can be made?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
markzzz
  • 47,390
  • 120
  • 299
  • 507
  • In a single instruction? No, but arithmetic right shift by 31 (of the FP bit pattern if you do it before conversion) => `andn` will mask any negative values to zero. (Also any `INT_MIN` from out of range conversions). `roundps` will do the `ceil` part. And BTW, you'll want to multiply by `1.0 / log2` instead of actually doing runtime division. – Peter Cordes Feb 01 '22 at 08:30
  • Handling the full 32-bit range is a problem, without introducing any extra rounding e.g. to a multiple of 16 or something by doing `v -= 2^31` before conversion to range-shift 0..2^32-1 to be centred around zero before conversion, then int add. – Peter Cordes Feb 01 '22 at 08:30
  • @PeterCordes for "roundps" do you mean _mm_ceil_ps? – markzzz Feb 01 '22 at 08:42
  • Yes, that's the intrinsic for the [`roundps`](https://www.felixcloutier.com/x86/roundps) instruction with the appropriate immediate control operand. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=MMX,SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2&text=roundps – Peter Cordes Feb 01 '22 at 08:45
  • @PeterCordes since the range will be limited (such as [0, 16] in the extreme case, use right shift for mask negative values would faster than do a sort of "clamp" (0, 16) using _mm_max_ps and _mm_min_ps? – markzzz Feb 01 '22 at 08:55
  • Oh, this doesn't need to work for numbers greater than INT_MAX, up to UINT_MAX? That's much easier than the problem as stated in the question. Yeah just use a signed conversion to epi32 (`int32_t`) and use `_mm_min_epi32` for the upper limit, and probably `_mm_max_epi32` for the lower limit. (Only one instruction instead of shift/and). Integer min/max are cheaper (lower latency, and better throughput), so prefer clamp after conversion. Out-of-range floats convert to INT_MIN so will clamp to 0. – Peter Cordes Feb 01 '22 at 09:03
  • If you want the integer part of the `log2` of a floating point number, there is no need at all for any expensive `sse_mathfun_log_ps` functions. Just reinterpret the `float` vector as `int`-vector and shift the exponent bits to the right. If you multiply by `2^(-128)` before that, you get the bias-subtraction and capping at zero for free. This truncates though. – chtz Feb 01 '22 at 09:31
  • @chtz not sure about your suggestions. any example/code? – markzzz Feb 01 '22 at 10:33

1 Answers1

2

since the range will be limited (such as [0, 16] in the extreme case

Oh, this doesn't need to work for numbers greater than INT_MAX, up to UINT_MAX? That's much easier than the problem as stated at the top of the question. Yeah just _mm_ceil_ps and use a signed conversion to epi32 (int32_t) and use _mm_min_epi32 for the upper limit, and probably _mm_max_epi32 for the lower limit. (Only one instruction instead of shift/and).

Or possibly _mm_sub_ps to range-shift to -16..0 / _mm_cvttps_epi32 to truncate (upwards towards zero), then integer subtract from zero. _mm_ceil_ps costs 2 uops on most CPUs, so that's about break-even, trading an FP operation for integer though. But requires more setup.

Integer min/max are cheaper (lower latency, and better throughput) than FP, so prefer clamp after conversion. Out-of-range floats convert to INT_MIN (high-bit set, others zero, what Intel calls the "integer indefinite" value) so will clamp to 0.


If you have a lot of this to do in a loop that doesn't do other FP computation, change the MXCSR rounding mode for this loop to round towards +Inf. Use _mm_cvtps_epi32 (which uses the current FP rounding mode, like lrint / (int)nearbyint) instead of ceil + cvtt (truncation).


This use-case: ceil(log2(float))

You could just pull that out of the FP bit pattern directly, and round up based on a non-zero mantissa. Binary floating point already contains a power-of-2 exponent field, so you just need to extract that with a bit of massaging.

Like _mm_and_ps / _mm_cmpeq_epi32 / _mm_add_epi32 to add the -1 compare result for FP values with a zero mantissa, so you treat powers of 2 differently from anything higher.

Should be faster than computing an FP log base e with a fractional part, even if it's only a quick approximation. Values smaller than 1.0 whose biased exponent is negative may need some extra handling.

Also, since you want all four indices, probably faster to just store to an array of 4 uint32_t values and access it, instead of using movd + 3x pextrd.

An even better way to round up to the next exponent for floats with a non-zero mantissa is to simply do an integer add of 0x007fffff to the bit-pattern. (23 set bits: https://en.wikipedia.org/wiki/Single-precision_floating-point_format).

// we round up the exponent separately from unbiasing.
// Might be possible to do better
__m128i ceil_log2_not_fully_optimized(__m128 v)
{
    // round up to the next power of 2 (exponent value) by carry-out from mantissa field into exponent
    __m128i floatbits = _mm_add_epi32(_mm_castps_si128(v), _mm_set1_epi32(0x007fffff));    

    __m128i exp = _mm_srai_epi32(floatbits, 23);   // arithmetic shift so negative numbers stay negative
    exp = _mm_sub_epi32(exp, _mm_set1_epi32(127));  // undo the bias
    exp = _mm_max_epi32(exp, _mm_setzero_si128());  // clamp negative numbers to zero.
    return exp;
}

If the exponent field was already all-ones, that means +Inf with an all-zero mantissa, else NaN. So it's only possible for carry propagation from the first add to flip the sign bit if the input was already NaN. +Inf gets treated as one exponent higher than FLT_MAX. 0.0 and 0.01 should both come out to 0, if I got this right.

According to GCC on Godbolt, I think so: https://godbolt.org/z/9G9orWj16 GCC doesn't fully constant-propagate through it, so we can actually see the input to pmaxsd, and see that 0.0 and 0.01 come out to max(0, -127) and max(0,-3) = 0 each. And 3.0 and 4.0 both come out to max(0, 2) = 2.


We can maybe even combine that +0x7ff... idea with adding a negative number to the exponent field to undo the bias.

Or to get carry-out into the sign bit correct, subtracting from it, with a 1 in the mantissa field so an all-zero mantissa will propagate a borrow and subtract one more from the exponent field? But small exponents less than the bias could still carry/borrow out and flip the sign bit. But that might be ok if we're going to clamp such values to zero anyway if they come out as small positive instead of negative.

I haven't worked out the details of this yet; if we need to handle the original input being zero, this might be a problem. If we can assume the original sign bit was cleared, but log(x) might be negative (i.e. exponent field below the bias), this should work just fine; carry out of the exponent field into the sign bit is exactly what we want in that case, so srai keeps it negative and max chooses the 0.

    // round up to the next power of 2 (exponent value) while undoing the bias
    const uint32_t f32_unbias = ((-127)<<23) + 0x007fffffU;
    ???
    profit
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • that was the question :) isn't there a _mm_cvtps_epi32 which round up without change MXCSR? (which is heavy to change at runtime) – markzzz Feb 01 '22 at 09:21
  • @markzzz: No, there's only truncation or current rounding mode for conversions, until AVX-512 where you can override MXCSR on a per-instruction basis. Truncation is specially provided for because that's how C `(int)float` works, so it's frequently needed for scalar code, and for vectorizing such code. You might get away with `+0.5` as an approximation before round-to-nearest, but I forget which cases that's "wrong" for. – Peter Cordes Feb 01 '22 at 09:23
  • tried the +0.5 trick, but it seems it always "truncate" and not round-to-nearest (maybe its my MXCSR setting?). don't know... – markzzz Feb 01 '22 at 10:28
  • also: added a "new" version (thanks to your suggestions). let me know if you see any improvements :) – markzzz Feb 01 '22 at 10:33
  • @markzzz: `_mm_cvttps_epi32` is always truncation; `_mm_cvtps_epi32` uses the current rounding mode. I meant `_mm_cvtps_epi32( v + 0.5 )` as a somewhat faster approximation to ceil, of course, not still using the truncating version. – Peter Cordes Feb 01 '22 at 10:56
  • @markzzz: Oh, I just realized you want the log2 of floats. You could just pull that out of the FP bit pattern directly, and round up based on a non-zero mantissa. (Like pcmpeqd / psubd to decrement for a zero mantissa, so you treat powers of 2 differently from anything higher) Should be faster than computing an FP log base e with a fractional part. – Peter Cordes Feb 01 '22 at 10:58
  • what about truncate and than +1 instead of ceil? :) about your second part of log, really i don't get what you are suggestings. any "scalar" math example before going vector? – markzzz Feb 01 '22 at 11:25
  • @markzzz: `truncate(x) + 1 != ceil(x)` when `x` is already a whole number, so ceil and truncate (and nearbyint) would all leave it unchanged. – Peter Cordes Feb 01 '22 at 12:06
  • 2
    @markzzz: I'm suggesting doing an integer log2 directly, by taking the exponent field of the FP bit-pattern since binary floating point is already based on powers of 2. Since you don't actually want the fractional part of the log2 at all, there's no point in using a polynomial approximation to it, or in getting natural log and then scaling back to log2. See [Efficient implementation of log2(\_\_m256d) in AVX2](https://stackoverflow.com/q/45770089) for some discussion of this as part of a full SIMD log implementation; you'd want to just take pieces of it. – Peter Cordes Feb 01 '22 at 12:10
  • about truncate, that won't be a huge problem in my case (it will go "next index" at the next float after a whole number, which is pretty "accettable"). i.e. if the ceil happens at 2.0 or 2.0001 is the same. about log2 int, I'll have a read and return soon, thanks :) – markzzz Feb 01 '22 at 12:47
  • @markzzz: Oh, well if you don't care what happens for whole numbers, that's obviously a useful optimization you can do. Again, that wasn't stated in question so that's not something answers can assume with no details on use-case initially. (I assumed that difference in cut-point was the reason for using `ceil` in the first place instead of getting it for free in the scalar addressing mode after `trunc`. Or some trickyness of getting it right for `trunc(0)` with order of when to clamp, if anything should use the smallest bucket.) – Peter Cordes Feb 01 '22 at 12:56
  • about the log2 optimization: are you suggesting to "round up" `phaseIncrement.v * mMinTopFreq.v` (which will gives 1, 2, 4, 8, 16, 32) and than shift to the right instead of `sse_mathfun_log_ps`? will this gives 0, 1, 2, 3, 4, 5 accordely? :O – markzzz Feb 01 '22 at 13:31
  • @markzzz: Hmm, integer add (`_mm_add_epi32`) of `0x007fffff` should actually work to round up floats with a non-zero mantissa to the next exponent value, so yes, I am suggesting that. Then `_mm_srai_epi32(v, 23)` to get the exponent field as an integer. (Negative if the float was negative). Except the exponent field is biased, so that needs accounting for. https://en.wikipedia.org/wiki/Single-precision_floating-point_format – Peter Cordes Feb 01 '22 at 13:37
  • uhm... `_mm_srai_epi32(_mm_set1_epi32(32), 23)` gives to me 0, not 5 (if I got correctly what you mean) – markzzz Feb 01 '22 at 13:56
  • 1
    @markzzz: Updated my answer with what I suggested. Seems to work for the few `_mm_setr_ps` inputs I tested. Of course `32 >> 23` is zero, IDK where you got that from; that's using `0x00000020` from an integer `set`. You want to shift the FP bit patterns to extract the exponent field, not an `(int)log2(x)` result! – Peter Cordes Feb 01 '22 at 14:12
  • wow, very awesome! it seems to works like a charm! what a hero you are, thanks!! – markzzz Feb 01 '22 at 15:36
  • "Also, since you want all four indices, probably faster to just store to an array of 4 uint32_t values and access it, instead of using movd + 3x pextrd." not sure about this suggestion. in fact the indexes are there to select a wave table index, which is references by pointer. isn't m128i an "array" in the end? – markzzz Feb 01 '22 at 15:49
  • 1
    @markzzz: When it's in an XMM register, no, it's not an array. That's why `_mm_extract_epi32` is an intrinsic for a `pextrd` instructions which runs as two ALU uops (shuffle and movd), instead of 1 uop for a load port. Look at the compiler-generated asm, and see https://uops.info/ and https://agner.org/optimize/. – Peter Cordes Feb 01 '22 at 15:57
  • 1
    The fact that you use the vector element later as an index for an array is why you need to get it from a vector register to a GP-integer register in the first place, since you don't have AVX2 for SIMD gathers, but that has no bearing on *how* you choose to get it there. (Or not much, anyway; if there was a back-end bottleneck on load ports in this and the surrounding code, an ALU strategy would be worth considering, but with 2 loads per clock vs. 1 shuffle per clock store/reload is cheaper for front-end uop throughput. Still `movd` the low element though.) – Peter Cordes Feb 01 '22 at 16:00
  • So basically you suggest, once computed the ceil/log, to just store it in an Array, and then access each position by operator [] right? My knowledge (wrong, i guess) Is that register access is faster than memory, thus I said "why Array" (which are basically in Memory) :P – markzzz Feb 01 '22 at 17:46
  • 1
    @markzzz: Just a temporary `alignas(16) int tmp[4]`. Store-forwarding from a vector store to scalar reloads works, so the CPU doesn't even have to wait for the data to commit to cache. Often it is better to shuffle instead of store/reload, but when you want all the scalar elements from a vector it starts to be worth looking at store/reload around 4 elements. For a single element, yes, `_mm_extract_epi32` is good. [What do you do without fast gather and scatter in AVX2 instructions?](https://stackoverflow.com/a/51131155) compares both strategies for the similar case of 4x i64 from `__m256i` – Peter Cordes Feb 01 '22 at 22:52
  • I see. maybe this can be useful for you http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html (some pretty code similar to what you posted. maybe optimized further?) – markzzz Feb 02 '22 at 12:42
  • @markzzz: If you want an actual log, not just the integer part of a log2, then yeah that's one way. That's one of the links I included in my answer on [Efficient implementation of log2(\_\_m256d) in AVX2](https://stackoverflow.com/a/45787548) - a few years ago I worked on a job that needed fast SIMD FP logs, but we ended up going with a slightly slower more accurate version using a polynomial of degree 4 or so for the mantissa (like in http://jrfonseca.blogspot.com/2008/09/fast-sse2-pow-tables-or-polynomials.html), especially with AVX-512 where `getmantps` was available. – Peter Cordes Feb 02 '22 at 12:51
  • nope, integer part of log2 is what I need it seems :) – markzzz Feb 02 '22 at 12:55
  • 1
    @markzzz: Ok, then that link you found may be helpful in understanding how binary floating-point works, but is certainly not going to be faster than 4 integer instructions, only 1 of which even touches the mantissa bits of the incoming float. – Peter Cordes Feb 02 '22 at 12:57
  • Question: for which float input values your ceil_log2_not_fully_optimized will fail? I think I've got an Edge case, but first i need to be sure is my fault and not limit of your function... – markzzz Feb 02 '22 at 18:28
  • @markzzz: I think only `-NaN`, but certainly I could be missing something. And of course `+Inf` produces a finite integer output. What input was problematic? It should be easy to exhaustively test all 1024 combinations of every exponent value, sign bit, and 0 vs. non-zero mantissa against a reference implementation that uses actual scalar `log`. (Or if you want, all ~4 billion float bit-patterns: https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/) – Peter Cordes Feb 03 '22 at 01:36
  • sorry for the "late" reply. about "Store-forwarding from a vector", tried _mm_store_si128((__m128i *)tmp, indexes);, but it doesn't gain so much. maybe already optimized out by compiler or is the wrong way to "store"? – markzzz Feb 16 '22 at 19:40
  • @markzzz: Maybe the next bottleneck after shuffle ALU throughput wasn't far behind, or that it only helped a small amount for front-end throughput. It's not huge for 4 elements, unlike 8 or 16. Or yes, it's possible that a compiler would compile that store + `tmp[1]` etc. back into `pextrd`. But since you did see a difference, probably not. It can't optimize *out*; it needs the data from the vector in an integer register somehow! Check the asm if you're curious. https://godbolt.org/ is useful to quickly find the asm associated with a source line. – Peter Cordes Feb 16 '22 at 19:46
  • @markzzz: Thanks for confirming that it was at least a bit faster; glad to hear my intuition was right that it would be at least as good, but not huge. – Peter Cordes Feb 16 '22 at 19:47