Actual runtime division (FP and integer) is very slow so you definitely want to avoid that. The way you wrote that expression probably compiles to include a division because FP math is not associative (without -ffast-math
); the compiler can't turn x / foo * bar
into x * (bar/foo)
for you, even though that's very good with loop-invariant bar/foo
. You do need either floating point or 64-bit integers to avoid overflow in a multiply, but only FP lets you reuse a non-integer loop-invariant division result.
_mm256_fmadd_ps
looks like the obvious way to go, with a pre-computed loop-invariant value for the multiplier (d - c) / (b - a)
. If float
rounding isn't a problem for doing it strictly in order (multiply then divide), it's probably ok to do this inexact division first, outside the loop. Like
_mm256_set1_ps((d - c) / (double)(b - a))
. Using double
for this calculation avoids rounding error during conversion to FP of the division operands.
You're reusing the same a,b,c,d for many x
, presumably coming from contiguous memory. You're using the result as part of a memory address so you do eventually need the results back from SIMD into integer registers, unfortunately. (Possibly with AVX512 scatter stores you could avoid that.)
Modern x86 CPUs have 2/clock load throughput so probably your best bet for getting 8x uint32_t back into integer registers is a vector store / integer reload, instead of spending 2 uops per element for ALU shuffle stuff. That has some latency so I'd suggest converting into a tmp buffer of maybe 16 or 32 ints (64 or 128 bytes), i.e. 2x or 4x __m256i
before looping through that scalar.
Or maybe alternate converting and storing one vector then looping over the 8 elements of another one that you converted earlier. i.e. software pipelining. Out-of-order execution can hide latency but you're already going to be stretching its latency-hiding capability for cache misses for whatever you're doing with memory.
Depending on your CPU (e.g. Haswell or some Skylake), using 256-bit vector instructions might cap your max turbo slightly lower than it would otherwise. You might consider only doing vectors of 4 at once but then you're spending more uops per element.
If not SIMD, then even scalar C++ fma()
is still good, for vfmadd213sd
, but using intrinsics is a very convenient way to get rounding (instead of truncation) from float -> int (vcvtps2dq
rather than vcvttps2dq
).
Note that uint32_t
<-> float
conversion isn't directly available until AVX512. For scalar you can just convert to/from int64_t with truncation / zero-extension for the unsigned low half.
It's very convenient that (as discussed in comments) your inputs are range-limited so if you interpret them as signed integers they have the same value (signed non-negative). Both x
and x-a
(and b-a
) are known to be positive and <= INT32_MAX i.e 0x7FFFFFFF
. (Or at least non-negative. Zero is fine.)
Float Rounding
For SIMD, single-precision float
is very good for SIMD throughput. Efficient packed-conversion to/from signed int32_t. But not every int32_t
can be exactly represented as a float
. Larger values get rounded to the nearest even, nearest multiple of 2^2, 2^3, or more the farther above 2^24 the value is.
Using SIMD double
is possible but requires some shuffling.
I don't think float
is usually a problem for the formula as-written with (float)(x-a)
. If the b-a
input range is large, that means both ranges are large and rounding error isn't going to map all possible x
values into the same output. Depending on the multiplier, the input rounding error might be worse than the output rounding error, maybe leaving some representable output floats unused for higher x-a
values.
But if we want to factor out the -a * (d - c) / (b - a)
part and combine it with the +c
at the end, then
- We potentially have precision loss from catastrophic cancellation in that value to be added.
- We need to do
(float)x
on the raw input value. If a
is huge and b-a
is small, i.e. a small range near the top of the possible input range, rounding error can map all possible x
values to the same float.
- To make best use of FMA, we want to do the
+c
before converting back to integer, which again risks output rounding error if the d-c
is a small output range but c
is huge. In your case not a problem; with d
<= 2^20 - 1 we know that float
can exactly represent every output integer value in that c..d range.
If you didn't have the input range constraint, you could range-shift to/from signed before the scaling by using integer (x-a)+0x80000000U
on input and ...+c+0x80000000U
on output (after rounding to nearest int32_t
). But that would introduce huge float
rounding error for small uint32_t
inputs (close to 0) which get range-shifted to close to INT_MIN
.
We don't need to range-shift for the b-a
or d-c
because the + or - or XOR with 0x80000000U
would cancel out in the subtractions.
Example:
The const
vectors should be hoisted out of a loop by the compiler after this inlines,
or you can do that manually.
This requires AVX1 + FMA (e.g. AMD Piledriver or Intel Haswell or later). Untested, sorry I didn't even throw this on Godbolt to see if it compiles.
// fastest but not safe if b-a is small and a > 2^24
static inline
__m256i range_scale_fast_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
double scale_scalar = (d - c) / (double)(b - a);
const __m256 scale = _mm256_set1_ps(scale_scalar);
const __m256 add = _m256_set1_ps(-a*scale_scalar + c);
// (x-a) * scale + c
// = x * scale + (-a*scale + c) but with different rounding error from doing -a*scale + c
__m256 in = _mm256_cvtepi32_ps(data);
__m256 out = _mm256_fmadd_ps(in, scale, add);
return _mm256_cvtps_epi32(out); // convert back with round to nearest-even
// _mm256_cvttps_epi32 truncates, matching C rounding; maybe good for scalar testing
}
Or a safer version, doing the input range-shift with integer: You could easily avoid FMA here if necessary for portability (just AVX1) and use an integer add for the output, too. But we know the output range is small enough that it can always exactly represent any integer
static inline
__m256i range_scale_safe_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
const __m256 scale = _mm256_set1_ps((d - c) / (double)(b - a));
const __m256 cvec = _m256_set1_ps(c);
__m256i in_offset = _mm256_add_epi32(data, _mm256_set1_epi32(-a)); // add can more easily fold a load of a memory operand than sub because it's commutative. Only some compilers will do this for you.
__m256 in_fp = _mm256_cvtepi32_ps(in_offset);
__m256 out = _mm256_fmadd_ps(in_fp, scale, _mm256_set1_ps(c)); // in*scale + c
return _mm256_cvtps_epi32(out);
}
Without FMA you could still use vmulps
. You might as well convert back to integer before adding c
if you're doing that, although vaddps
would be safe.
You might use this in a loop like
void foo(uint32_t *arr, ptrdiff_t len)
{
if (len < 24) special case;
alignas(32) uint32_t tmpbuf[16];
// peel half of first iteration for software pipelining / loop rotation
__m256i arrdata = _mm256_loadu_si256((const __m256i*)&arr[0]);
__m256i outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)tmpbuf, outrange);
// could have used an unsigned loop counter
// since we probably just need an if() special case handler anyway for small len which could give len-23 < 0
for (ptrdiff_t i = 0 ; i < len-(15+8) ; i+=16 ) {
// prep next 8 elements
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+8]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[8], outrange);
// use first 8 elements
for (int j=0 ; j<8 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
// prep 8 more for next iteration
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+16]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[0], outrange);
// use 2nd 8 elements
for (int j=8 ; j<16 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
}
// use tmpbuf[0..7]
// then cleanup: one vector at a time until < 8 or < 4 with 128-bit vectors, then scalar
}
These variable-names sound dumb but I couldn't think of anything better.
This software pipelining is an optimization; you can just get it working / try it out with a single vector at a time used right away. (Optimize the reload of the first element from a reload to vmovd
using _mm_cvtsi128_si32(_mm256_castsi256_si128(outrange))
if you want.)
Special cases
If there cases where you know (b - a)
is a power of 2, you could bitscan with tzcnt
or bsf
, then multiply. (There are intrinsics for those, like GNU C __builtin_ctz()
to count trailing zeros.)
Or can you ensure that (b - a)
is always a power of 2?
Or better, if (b - a) / (d - c)
is an exact power of 2 the whole thing can just be sub / right shift / add.
If you can't always ensure that you'd still need the general case sometimes, but maybe possible to do that efficiently.