1

In runtime I have 2 ranges defined by their uint32_t borders a..b and c..d. The first range tends to be much greater than the second: 8 < (b - a) / (d - c) < 64.
Exact limits: a >= 0, b <= 2^31 - 1, c >= 0, d <= 2^20 - 1.

I need a routine that performs linear mapping of an integer from the first range onto the second one: f(uint32_t x) -> round_to_uint32_t((float)(x - a) / (b - a) * (d - c) + c).
When b - a >= d - c it is important to mantain the ratio as close to ideal as possible, otherwise in cases when element from [a; b] can be mapped on more than one integer from [c; d] it is okay to return any of these integers.

Sounds like a simple ratio problem and was already answered in many questions like
Convert a number range to another range, maintaining ratio
but here I need a really really fast solution.

This routine is a pivotal part of a specialized sorting algorithm and will be called at least once for every element of a sorted array.

SIMD solution is also acceptable if it doesn't drop overall performance.

Andrey Godyaev
  • 856
  • 1
  • 8
  • 18
  • 1
    What do you mean "not *heavy* AVX"? `_mm256_fmadd_ps` looks like the obvious way to go, with a pre-computed loop-invariant value for the multiplier `(d - c) / (b - a)`. Can you usefully do this on a group of inputs that are in contiguous memory? If not, then scalar `fma()` is still good, and simplifies the uint32_t <-> float conversion. (You can just convert to/from scalar 64-bit integers, and treat the low half as unsigned). – Peter Cordes Nov 07 '19 at 03:47
  • @PeterCordes i'm sorry for this spurrious remark. I'm new to the low-level stuff and I often get performance drops when tinkering with AVX. Maybe this is because of quite outdated benchmark cpu which is one core of Xeon E5-2660. Can you write your solution as a full answer? – Andrey Godyaev Nov 07 '19 at 07:20
  • This is for a Bucket Sort or something, right? (So you eventually need these results in integer regs for an addressing mode.) Can you Radix sort instead, using the top few bits as an index? Or can you assume that `x-a` <= `INT_MAX`? Or can you take advantage of that as a special case for some data sets? I'm working on an answer but didn't know whether to assume huge ranges (bigger than FP rounding error from a range-shift to signed int) or work with small unsigned integers. Or if not small, then at least `a..b` being a small range compared to the full range of `uint32_t`. – Peter Cordes Nov 07 '19 at 07:22
  • No i've already tried Bucket and Radix Sorts in many variations. I have a custom algorithm for a particular data sets. – Andrey Godyaev Nov 07 '19 at 07:24
  • Ok, but did I guess right that you eventually want to use this scaled number to figure out where to store the original? So you need to either get it back into integer registers for a scalar addressing mode or possibly use it with an AVX512 scatter store. Although *probably* it's still a win to use SIMD even if you can't just store 4x or 8x `uint32_t` results to contiguous memory. – Peter Cordes Nov 07 '19 at 07:27
  • 1
    Exact limits: `a >= 0`, `b <= 2^31 - 1`, `c >= 0`, `d <= 2^20 - 1`. Ity is always possible to zero `c`. – Andrey Godyaev Nov 07 '19 at 07:30
  • 1
    Ok perfect, so you can treat your inputs as signed-positive or unsigned, and `x-a` is always going to be signed positive, not still in the upper half of the unsigned range. And it's ok to round to single-precision `float`? Would rounding `x` to `float` instead of `x-a` be ok? That lets you do it with cvt + vfmaddps + cvt (because the `-a * scale` can be combined with the `+c` at the end). – Peter Cordes Nov 07 '19 at 07:33
  • Yes i'm going to use the routine results as pointer offsets and i need them as integers. Actually it is reflected in the formula. – Andrey Godyaev Nov 07 '19 at 07:33
  • Yes, there are only unsigned integers and no possibility of an overflow on substraction. – Andrey Godyaev Nov 07 '19 at 07:35
  • Yes I know you need them as `uint32_t`, but the question was whether you needed to get them out YMM0 into RAX (a GP-integer *register*) for use in an addressing mode, instead of just storing 8 results to memory with `vmovdqu`. – Peter Cordes Nov 07 '19 at 07:35
  • 1
    IDK what you mean by "they are *only* unsigned integers". The full range of `uint32_t` is `0 .. 2^32 - 1`. You couldn't implement this as efficiently if `x-a` could be >= INT_MAX. So it's a good thing that your numbers have the same value if interpreted as `int32_t` (i.e. their high bit isn't set). Anyway, I understand what the actual ranges are, I'm only nit-picking about the terminology in your followup comment. – Peter Cordes Nov 07 '19 at 07:39
  • Well, i need this snippet as an inline function to be used in several places in С++ code. So it should be in `RAX` register. – Andrey Godyaev Nov 07 '19 at 07:41
  • 1
    If parts of that C++ code can vectorize, then you wouldn't need it in RAX. That's why I asked *how* you used the value (specifically as part of a memory address). Apparently the answer to the important question was yes, so you do need it in RAX (or another integer reg), but that doesn't follow automatically from using it several times in C++. You tagged this assembly; I'm thinking in terms of what code-gen we want the compiler to make. (Because that's how you write efficient C++) – Peter Cordes Nov 07 '19 at 08:33
  • You reuse the same a,b,c,d for multiple `x` values, right? Like at least a full SIMD vector of 8? – Peter Cordes Nov 07 '19 at 08:39
  • 1
    Yes, i reuse those many times, up to millions of calculations without `a`, `b`, `c`, `d` changing. – Andrey Godyaev Nov 07 '19 at 08:44
  • @chtz you're very possible correct, my own code hasn't reached the readyness level when i would check border conditions. It's just an ugly patch that deals with a bottleneck. – Andrey Godyaev Nov 07 '19 at 12:30
  • @AndreyGodyaev (I realized Peter already asked the exact same question, so I deleted it). If you can de-facto just work on one element at a time, you could just convert to `double` (every `[u]int32` has an exact representation as `double` and do the mapping in that domain). – chtz Nov 07 '19 at 12:40
  • 1
    For clarification, could you specify (inside the question, not the comments), if (and at what points) you are fine with losing precision. Also write how you intend to use the scaled result. The information is hard to extract from the comments. – chtz Nov 07 '19 at 12:49

1 Answers1

4

Actual runtime division (FP and integer) is very slow so you definitely want to avoid that. The way you wrote that expression probably compiles to include a division because FP math is not associative (without -ffast-math); the compiler can't turn x / foo * bar into x * (bar/foo) for you, even though that's very good with loop-invariant bar/foo. You do need either floating point or 64-bit integers to avoid overflow in a multiply, but only FP lets you reuse a non-integer loop-invariant division result.

_mm256_fmadd_ps looks like the obvious way to go, with a pre-computed loop-invariant value for the multiplier (d - c) / (b - a). If float rounding isn't a problem for doing it strictly in order (multiply then divide), it's probably ok to do this inexact division first, outside the loop. Like
_mm256_set1_ps((d - c) / (double)(b - a)). Using double for this calculation avoids rounding error during conversion to FP of the division operands.

You're reusing the same a,b,c,d for many x, presumably coming from contiguous memory. You're using the result as part of a memory address so you do eventually need the results back from SIMD into integer registers, unfortunately. (Possibly with AVX512 scatter stores you could avoid that.)

Modern x86 CPUs have 2/clock load throughput so probably your best bet for getting 8x uint32_t back into integer registers is a vector store / integer reload, instead of spending 2 uops per element for ALU shuffle stuff. That has some latency so I'd suggest converting into a tmp buffer of maybe 16 or 32 ints (64 or 128 bytes), i.e. 2x or 4x __m256i before looping through that scalar.

Or maybe alternate converting and storing one vector then looping over the 8 elements of another one that you converted earlier. i.e. software pipelining. Out-of-order execution can hide latency but you're already going to be stretching its latency-hiding capability for cache misses for whatever you're doing with memory.

Depending on your CPU (e.g. Haswell or some Skylake), using 256-bit vector instructions might cap your max turbo slightly lower than it would otherwise. You might consider only doing vectors of 4 at once but then you're spending more uops per element.

If not SIMD, then even scalar C++ fma() is still good, for vfmadd213sd, but using intrinsics is a very convenient way to get rounding (instead of truncation) from float -> int (vcvtps2dq rather than vcvttps2dq).


Note that uint32_t <-> float conversion isn't directly available until AVX512. For scalar you can just convert to/from int64_t with truncation / zero-extension for the unsigned low half.

It's very convenient that (as discussed in comments) your inputs are range-limited so if you interpret them as signed integers they have the same value (signed non-negative). Both x and x-a (and b-a) are known to be positive and <= INT32_MAX i.e 0x7FFFFFFF. (Or at least non-negative. Zero is fine.)

Float Rounding

For SIMD, single-precision float is very good for SIMD throughput. Efficient packed-conversion to/from signed int32_t. But not every int32_t can be exactly represented as a float. Larger values get rounded to the nearest even, nearest multiple of 2^2, 2^3, or more the farther above 2^24 the value is.

Using SIMD double is possible but requires some shuffling.

I don't think float is usually a problem for the formula as-written with (float)(x-a). If the b-a input range is large, that means both ranges are large and rounding error isn't going to map all possible x values into the same output. Depending on the multiplier, the input rounding error might be worse than the output rounding error, maybe leaving some representable output floats unused for higher x-a values.

But if we want to factor out the -a * (d - c) / (b - a) part and combine it with the +c at the end, then

  1. We potentially have precision loss from catastrophic cancellation in that value to be added.
  2. We need to do (float)x on the raw input value. If a is huge and b-a is small, i.e. a small range near the top of the possible input range, rounding error can map all possible x values to the same float.
  3. To make best use of FMA, we want to do the +c before converting back to integer, which again risks output rounding error if the d-c is a small output range but c is huge. In your case not a problem; with d <= 2^20 - 1 we know that float can exactly represent every output integer value in that c..d range.

If you didn't have the input range constraint, you could range-shift to/from signed before the scaling by using integer (x-a)+0x80000000U on input and ...+c+0x80000000U on output (after rounding to nearest int32_t). But that would introduce huge float rounding error for small uint32_t inputs (close to 0) which get range-shifted to close to INT_MIN.

We don't need to range-shift for the b-a or d-c because the + or - or XOR with 0x80000000U would cancel out in the subtractions.


Example:

The const vectors should be hoisted out of a loop by the compiler after this inlines, or you can do that manually.

This requires AVX1 + FMA (e.g. AMD Piledriver or Intel Haswell or later). Untested, sorry I didn't even throw this on Godbolt to see if it compiles.

// fastest but not safe if b-a is small and  a > 2^24
static inline
__m256i range_scale_fast_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
     // avoid rounding errors when computing the scale factor, but convert double->float on the final result
    double scale_scalar = (d - c) / (double)(b - a);
    const __m256 scale = _mm256_set1_ps(scale_scalar);
    const __m256 add = _m256_set1_ps(-a*scale_scalar + c);
    //    (x-a) * scale + c
    // =  x * scale + (-a*scale + c)   but with different rounding error from doing -a*scale + c

    __m256  in = _mm256_cvtepi32_ps(data);
    __m256  out = _mm256_fmadd_ps(in, scale, add);
    return _mm256_cvtps_epi32(out);   // convert back with round to nearest-even
                                   // _mm256_cvttps_epi32 truncates, matching C rounding; maybe good for scalar testing
}

Or a safer version, doing the input range-shift with integer: You could easily avoid FMA here if necessary for portability (just AVX1) and use an integer add for the output, too. But we know the output range is small enough that it can always exactly represent any integer

static inline
__m256i range_scale_safe_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
     // avoid rounding errors when computing the scale factor, but convert double->float on the final result
    const __m256 scale = _mm256_set1_ps((d - c) / (double)(b - a));
    const __m256 cvec = _m256_set1_ps(c);

    __m256i in_offset = _mm256_add_epi32(data, _mm256_set1_epi32(-a));  // add can more easily fold a load of a memory operand than sub because it's commutative.  Only some compilers will do this for you.
    __m256  in_fp = _mm256_cvtepi32_ps(in_offset);
    __m256  out = _mm256_fmadd_ps(in_fp, scale, _mm256_set1_ps(c));  // in*scale + c
    return _mm256_cvtps_epi32(out);
}

Without FMA you could still use vmulps. You might as well convert back to integer before adding c if you're doing that, although vaddps would be safe.

You might use this in a loop like

void foo(uint32_t *arr, ptrdiff_t len)
{
    if (len < 24) special case;

    alignas(32) uint32_t tmpbuf[16];

    // peel half of first iteration for software pipelining / loop rotation
    __m256i arrdata = _mm256_loadu_si256((const __m256i*)&arr[0]);
    __m256i outrange = range_scale_safe_fma(arrdata);
    _mm256_store_si256((__m256i*)tmpbuf, outrange);

    // could have used an unsigned loop counter
    // since we probably just need an if() special case handler anyway for small len which could give len-23 < 0
    for (ptrdiff_t i = 0 ; i < len-(15+8) ; i+=16 ) {

        // prep next 8 elements
        arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+8]);
        outrange = range_scale_safe_fma(arrdata);
        _mm256_store_si256((__m256i*)&tmpbuf[8], outrange);

        // use first 8 elements
        for (int j=0 ; j<8 ; j++) {
            use tmpbuf[j] which corresponds to arr[i+j]
        }

        // prep 8 more for next iteration
        arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+16]);
        outrange = range_scale_safe_fma(arrdata);
        _mm256_store_si256((__m256i*)&tmpbuf[0], outrange);

        // use 2nd 8 elements
        for (int j=8 ; j<16 ; j++) {
            use tmpbuf[j] which corresponds to arr[i+j]
        }
    }

    // use tmpbuf[0..7]
    // then cleanup: one vector at a time until < 8 or < 4 with 128-bit vectors, then scalar
}

These variable-names sound dumb but I couldn't think of anything better.

This software pipelining is an optimization; you can just get it working / try it out with a single vector at a time used right away. (Optimize the reload of the first element from a reload to vmovd using _mm_cvtsi128_si32(_mm256_castsi256_si128(outrange)) if you want.)


Special cases

If there cases where you know (b - a) is a power of 2, you could bitscan with tzcnt or bsf, then multiply. (There are intrinsics for those, like GNU C __builtin_ctz() to count trailing zeros.)

Or can you ensure that (b - a) is always a power of 2?

Or better, if (b - a) / (d - c) is an exact power of 2 the whole thing can just be sub / right shift / add.

If you can't always ensure that you'd still need the general case sometimes, but maybe possible to do that efficiently.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you. The MR that contained this subtask was merged finally. The implementation would be much less effective without your help. – Andrey Godyaev Mar 12 '20 at 11:27