24

There is an existing question "Average of 3 long integers" that is specifically concerned with the efficient computation of the average of three signed integers.

The use of unsigned integers however allows for additional optimizations not applicable to the scenario covered in the previous question. This question is about the efficient computation of the average of three unsigned integers, where the average is rounded towards zero, i.e. in mathematical terms I want to compute ⌊ (a + b + c) / 3 ⌋.

A straightforward way to compute this average is

 avg = a / 3 + b / 3 + c / 3 + (a % 3 + b % 3 + c % 3) / 3;

To first order, modern optimizing compilers will transform the divisions into multiplications with a reciprocal plus a shift, and the modulo operations into a back-multiply and a subtraction, where the back-multiply may use a scale_add idiom available on many architectures, e.g. lea on x86_64, add with lsl #n on ARM, iscadd on NVIDIA GPUs.

In trying to optimize the above in a generic fashion suitable for many common platforms, I observe that typically the cost of integer operations is in the relationship logical ≤ (add | sub) ≤ shiftscale_addmul. Cost here refers to all of latency, throughput limitations, and power consumption. Any such differences become more pronounced when the integer type processed is wider than the native register width, e.g. when processing uint64_t data on a 32-bit processor.

My optimization strategy was therefore to minimize instruction count and replace "expensive" with "cheap" operations where possible, while not increasing register pressure and retaining exploitable parallelism for wide out-of-order processors.

The first observation is that we can reduce a sum of three operands into a sum of two operands by first applying a CSA (carry save adder) that produces a sum value and a carry value, where the carry value has twice the weight of the sum value. The cost of a software-based CSA is five logicals on most processors. Some processors, like NVIDIA GPUs, have a LOP3 instruction that can compute an arbitrary logical expression of three operands in one fell swoop, in which case CSA condenses to two LOP3s (note: I have yet convince the CUDA compiler to emit those two LOP3s; it currently produces four LOP3s!).

The second observation is that because we are computing the modulo of division by 3, we don't need a back-multiply to compute it. We can instead use dividend % 3 = ((dividend / 3) + dividend) & 3, reducing the modulo to an add plus a logical since we already have the division result. This is an instance of the general algorithm: dividend % (2n-1) = ((dividend / (2n-1) + dividend) & (2n-1).

Finally for the division by 3 in the correction term (a % 3 + b % 3 + c % 3) / 3 we don't need the code for generic division by 3. Since the dividend is very small, in [0, 6], we can simplify x / 3 into (3 * x) / 8 which requires just a scale_add plus a shift.

The code below shows my current work-in-progress. Using Compiler Explorer to check the code generated for various platforms shows the tight code I would expect (when compiled with -O3).

However, in timing the code on my Ivy Bridge x86_64 machine using the Intel 13.x compiler, a flaw became apparent: while my code improves latency (from 18 cycles to 15 cycles for uint64_t data) compared to the simple version, throughput worsens (from one result every 6.8 cycles to one result every 8.5 cycles for uint64_t data). Looking at the assembly code more closely it is quite apparent why that is: I basically managed to take the code down from roughly three-way parallelism to roughly two-way parallelism.

Is there a generically applicable optimization technique, beneficial on common processors in particular all flavors of x86 and ARM as well as GPUs, that preserves more parallelism? Alternatively, is there an optimization technique that further reduces overall operation count to make up for reduced parallelism? The computation of the correction term (tail in the code below) seems like a good target. The simplification (carry_mod_3 + sum_mod_3) / 2 looked enticing but delivers an incorrect result for one of the nine possible combinations.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define BENCHMARK           (1)
#define SIMPLE_COMPUTATION  (0)

#if BENCHMARK
#define T uint64_t
#else // !BENCHMARK
#define T uint8_t
#endif // BENCHMARK

T average_of_3 (T a, T b, T c) 
{
    T avg;

#if SIMPLE_COMPUTATION
    avg = a / 3 + b / 3 + c / 3 + (a % 3 + b % 3 + c % 3) / 3;
#else // !SIMPLE_COMPUTATION
    /* carry save adder */
    T a_xor_b = a ^ b;
    T sum = a_xor_b ^ c;
    T carry = (a_xor_b & c) | (a & b);
    /* here 2 * carry + sum = a + b + c */
    T sum_div_3 = (sum / 3);                                   // {MUL|MULHI}, SHR
    T sum_mod_3 = (sum + sum_div_3) & 3;                       // ADD, AND

    if (sizeof (size_t) == sizeof (T)) { // "native precision" (well, not always)
        T two_carry_div_3 = (carry / 3) * 2;                   // MULHI, ANDN
        T two_carry_mod_3 = (2 * carry + two_carry_div_3) & 6; // SCALE_ADD, AND
        T head = two_carry_div_3 + sum_div_3;                  // ADD
        T tail = (3 * (two_carry_mod_3 + sum_mod_3)) / 8;      // ADD, SCALE_ADD, SHR
        avg = head + tail;                                     // ADD
    } else {
        T carry_div_3 = (carry / 3);                           // MUL, SHR
        T carry_mod_3 = (carry + carry_div_3) & 3;             // ADD, AND
        T head = (2 * carry_div_3 + sum_div_3);                // SCALE_ADD
        T tail = (3 * (2 * carry_mod_3 + sum_mod_3)) / 8;      // SCALE_ADD, SCALE_ADD, SHR
        avg = head + tail;                                     // ADD
    }
#endif // SIMPLE_COMPUTATION
    return avg;
}

#if !BENCHMARK
/* Test correctness on 8-bit data exhaustively. Should catch most errors */
int main (void)
{
    T a, b, c, res, ref;
    a = 0;
    do {
        b = 0;
        do {
            c = 0;
            do {
                res = average_of_3 (a, b, c);
                ref = ((uint64_t)a + (uint64_t)b + (uint64_t)c) / 3;
                if (res != ref) {
                    printf ("a=%08x  b=%08x  c=%08x  res=%08x  ref=%08x\n", 
                            a, b, c, res, ref);
                    return EXIT_FAILURE;
                }
                c++;
            } while (c);
            b++;
        } while (b);
        a++;
    } while (a);
    return EXIT_SUCCESS;
}

#else // BENCHMARK

#include <math.h>

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

#define N  (3000000)
int main (void)
{
    double start, stop, elapsed = INFINITY;
    int i, k;
    T a, b;
    T avg0  = 0xffffffff,  avg1 = 0xfffffffe;
    T avg2  = 0xfffffffd,  avg3 = 0xfffffffc;
    T avg4  = 0xfffffffb,  avg5 = 0xfffffffa;
    T avg6  = 0xfffffff9,  avg7 = 0xfffffff8;
    T avg8  = 0xfffffff7,  avg9 = 0xfffffff6;
    T avg10 = 0xfffffff5, avg11 = 0xfffffff4;
    T avg12 = 0xfffffff2, avg13 = 0xfffffff2;
    T avg14 = 0xfffffff1, avg15 = 0xfffffff0;

    a = 0x31415926;
    b = 0x27182818;
    avg0 = average_of_3 (a, b, avg0);
    for (k = 0; k < 5; k++) {
        start = second();
        for (i = 0; i < N; i++) {
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            avg0 = average_of_3 (a, b, avg0);
            b = (b + avg0) ^ a;
            a = (a ^ b) + avg0;
        }
        stop = second();
        elapsed = fmin (stop - start, elapsed);
    }
    printf ("a=%016llx b=%016llx avg=%016llx", 
            (uint64_t)a, (uint64_t)b, (uint64_t)avg0);
    printf ("\rlatency:    each average_of_3() took  %.6e seconds\n", 
            elapsed / 16 / N);


    a = 0x31415926;
    b = 0x27182818;
    avg0 = average_of_3 (a, b, avg0);
    for (k = 0; k < 5; k++) {
        start = second();
        for (i = 0; i < N; i++) {
            avg0  = average_of_3 (a, b, avg0);
            avg1  = average_of_3 (a, b, avg1);
            avg2  = average_of_3 (a, b, avg2);
            avg3  = average_of_3 (a, b, avg3);
            avg4  = average_of_3 (a, b, avg4);
            avg5  = average_of_3 (a, b, avg5);
            avg6  = average_of_3 (a, b, avg6);
            avg7  = average_of_3 (a, b, avg7);
            avg8  = average_of_3 (a, b, avg8);
            avg9  = average_of_3 (a, b, avg9);
            avg10 = average_of_3 (a, b, avg10);
            avg11 = average_of_3 (a, b, avg11);
            avg12 = average_of_3 (a, b, avg12);
            avg13 = average_of_3 (a, b, avg13);
            avg14 = average_of_3 (a, b, avg14);
            avg15 = average_of_3 (a, b, avg15);
            b = (b + avg0) ^ a;
            a = (a ^ b) + avg0;
        }
        stop = second();
        elapsed = fmin (stop - start, elapsed);
    }
    printf ("a=%016llx b=%016llx avg=%016llx", (uint64_t)a, (uint64_t)b, 
            (uint64_t)(avg0 + avg1 + avg2 + avg3 + avg4 + avg5 + avg6 + avg7 + 
                       avg8 + avg9 +avg10 +avg11 +avg12 +avg13 +avg14 +avg15));
    printf ("\rthroughput: each average_of_3() took  %.6e seconds\n", 
            elapsed / 16 / N);

    return EXIT_SUCCESS;
}

#endif // BENCHMARK
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • I don't think your `carry = (a_xor_b & c) | (a & b)` catches `a&b&c`. And why would (4/3 `sum`)%4 equal `sum`%3? – greybeard Oct 27 '20 at 22:31
  • It might be best to edit those test programs into the question; they're long enough to become a mess in SO comments. Perhaps link a short version that tests all combos of low 3 bits of each input, and of their MSBs, maybe on https://tio.run/ or https://godbolt.org/, since Godbolt has a compile-and-run option these days. – Peter Cordes Oct 28 '20 at 03:58
  • what's the distribution of the inputs? – Sopel Oct 28 '20 at 17:04
  • @Sopel: Can't make any assumptions about it, so assume uniform random. Branchless solution therefore pretty much required. – njuffa Oct 28 '20 at 19:25
  • @njuffa IMO it is worth to optimize it for the particular architecture and data size. If we consider 64-bit system and 32 bits data (integers are 32 bits on most 64 bits systems) the "naive" approach is the most efficient. https://godbolt.org/z/dh1jrP IMO if we are considering the micro-optimizations and uOptimizations matter it cannot be abstracted from the hardware and implementation. – 0___________ Oct 29 '20 at 22:19
  • I don't understand why, in your benchmarks, you bothered testing anything other than `size_t`. If your integer is any smaller than that, then you can just perform the computation in `size_t` without worrying about overflow at all. – KevinZ Nov 01 '20 at 20:45

7 Answers7

16

Let me throw my hat in the ring. Not doing anything too tricky here, I think.

#include <stdint.h>

uint64_t average_of_three(uint64_t a, uint64_t b, uint64_t c) {
  uint64_t hi = (a >> 32) + (b >> 32) + (c >> 32);
  uint64_t lo = hi + (a & 0xffffffff) + (b & 0xffffffff) + (c & 0xffffffff);
  return 0x55555555 * hi + lo / 3;
}

Following discussion below about different splits, here's a version that saves a multiply at the expense of three bitwise-ANDs:

T hi = (a >> 2) + (b >> 2) + (c >> 2);
T lo = (a & 3) + (b & 3) + (c & 3);
avg = hi + (hi + lo) / 3;
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • 2
    Clang does a better job with this than GCC for x86-64. gcc spends more instructions, and does the truncation-to-32-bit with `mov same,same` defeating [mov-elimination](https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all). https://godbolt.org/z/ejYxvT (e.g. `mov esi,esi` to zero-extend ESI into RSI still costs a back-end uop where `mov r8d, esi` doesn't, which probably matters for the OP's Ivy Bridge CPU (front end wider than the back end) moreso than for Haswell and later.) And also for latency everywhere if that's on the critical path. – Peter Cordes Oct 31 '20 at 01:23
  • 3
    I like the simplicity of this; seems I got my brain into too many knots. Since I (= OP) am using the Intel compiler this variant actually works better for me on Ivy Bridge than any other variant tried so far, both in terms of latency and throughput. Other than `gcc` on `x86_64` with the weird truncation idioms, also gives tighter code for all compilers I checked on godbolt. Since I need `uint32_t` and `uint64_t` versions of this I generalized like so: `const int shift = ((sizeof(T) * CHAR_BIT) / 2); const T mask = (((T)1) << shift) - ((T)1); const T multiplier = (((T)(~(T)0)) / 3) >> shift;` – njuffa Oct 31 '20 at 04:33
  • @njuffa: for your uint32_t version on a 64-bit machine you'd probably just want to zero-extend all 3 inputs to 64-bit with `(a + (uint64_t)b + c) / 3`. Values in registers are often already known to be zero-extended so this is nearly free, if you have a fast 64-bit multiply. But yes, this is impressively elegant for inputs that are already full register width. – Peter Cordes Oct 31 '20 at 07:05
  • 2
    @njuffa: for 32-bit x86 on IvyBridge and later, a split of lo = 8, hi = 24 bits lets us use `movzx dword, byte` which allows mov-elimination for byte-source movzx. (On Intel but not AMD. Only for byte-source, not word-source. [Same link as my first comment](https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all) for details.) If the math still works, that would save back-end uops and possibly reduce critical-path latency. – Peter Cordes Oct 31 '20 at 07:14
  • Can we zero-extend once *after* adding, and fix up the 2 bits that should just have carry-out from the 2 adds? e.g. `lo = (a+b+c) & 0x3ffffffff - (hi&0x3)<<32;` if that works. That will suck on x86; the 34-bit mask needs a constant to be constructed with `movabs` + `and`, or a `bzhi`. Possibly it could be good on AArch64 with bit-range immediates, or a uint32_t version on ARM with the barrel shifter. Or PowerPC64 with `rlwinm` to extract and shift the low 2 bits of `hi` in one instruction. – Peter Cordes Oct 31 '20 at 07:20
  • @PeterCordes Thanks for pointing out these details. I'll note that the version posted, with my generalizations applied, performs identical in terms of latency (4.07ns each) for `uint64_t` and `uint32_t` and is just a tiny slower for `uint32_t` in terms of throughput (one per 2.6ns vs one per 2.3ns). Given the simplicity of this completely porable algorithm and the good performance across the board "as-is" (about 1.6x the performance of the original "simple" variant), I am currently inclined to leave it as posted. Your pointers may come in handy should further per-platform tweaking be necessary – njuffa Oct 31 '20 at 07:24
  • @njuffa. Sure. My 8 : 24 split is only useful for 32-bit mode on mainstream Intel specifically, not AMD, not Silvermont, not AFAIK ARM. In 64-bit mode, I think you're even better off with `(a + (uint64_t)b + c) / 3` on the same CPUs (because they have fast `mul r64` widening integer multiply). https://godbolt.org/z/7eaKjE. Using a split algorithm is worth considering on a few x86-64 CPUs that have slow 64-bit multiply, e.g. some Silvermont, and Bulldozer-family. I didn't try putting it in IACA or LLVM-MCA to analyze throughput and latency, though. – Peter Cordes Oct 31 '20 at 07:41
  • @njuffa: Oh, the 8:24 split has an advantage on some RISCs where `0x55` fits in a 1-byte immediate for a multiply. Like PowerPC. But on 64-bit RISCs using uint64_t for 32-bit computation is even better. The calling-convention sometimes forces manual zero-extension to 64-bit when they arrive sign-extended. (Which on RISC-V apparently takes 2 instructions, left and right shift, which is pretty bad...) But on AArch64 the uint64_t way is amazingly efficient; it has HW support for 32->64 zero extension of one input as part of a 64-bit operation like add. https://godbolt.org/z/9WWsPc – Peter Cordes Oct 31 '20 at 07:50
  • 1
    @njuffa You could also define `multiplier = mask / 3`. – David Eisenstat Oct 31 '20 at 12:36
6

I'm not sure if it fits your requirements, but maybe it works to just calculate the result and then fixup the error from the overflow:

T average_of_3 (T a, T b, T c)
{
    T r = ((T) (a + b + c)) / 3;
    T o = (a > (T) ~b) + ((T) (a + b) > (T) (~c));
    if (o) r += ((T) 0x5555555555555555) << (o - 1);
    T rem = ((T) (a + b + c)) % 3;
    if (rem >= (3 - o)) ++r;
    return r;
}

[EDIT] Here is the best branch-and-compare-less version I can come up with. On my machine, this version actually has slightly higher throughput than njuffa's code. __builtin_add_overflow(x, y, r) is supported by gcc and clang and returns 1 if the sum x + y overflows the type of *r and 0 otherwise, so the calculation of o is equivalent to the portable code in the first version, but at least gcc produces better code with the builtin.

T average_of_3 (T a, T b, T c)
{
    T r = ((T) (a + b + c)) / 3;
    T rem = ((T) (a + b + c)) % 3;
    T dummy;
    T o = __builtin_add_overflow(a, b, &dummy) + __builtin_add_overflow((T) (a + b), c, &dummy);
    r += -((o - 1) & 0xaaaaaaaaaaaaaaab) ^ 0x5555555555555555;
    r += (rem + o + 1) >> 2;
    return r;
}
Falk Hüffner
  • 4,942
  • 19
  • 25
  • Interesting idea! From what I can tell right now this is functionally correct and pretty efficient. The performance lags my solution by about 15% in terms of throughput (now that I have fixed my test framework) on my Ivy Bridge system: One result every 3.3ns vs one result every 2.84ns. Also, when compiling for ARM at Compiler Explorer I see a couple more instructions than in my solution. If you could manage to trim another two or three instructions we might have a universal winner. I have not looked at the GPU code yet (it's 4am here, time to go to sleep :-) – njuffa Oct 28 '20 at 11:18
  • Probably the compiler would do it anyway, but what about: `return r + (rem >= (3 - o));` – Elliott Oct 28 '20 at 12:03
  • Actually it's not quite correct, we need to do the first += conditional on `o != 0`. I'll think about how to express this more succinctly... – Falk Hüffner Oct 28 '20 at 13:27
  • OK, fixed and added a branchless version. – Falk Hüffner Oct 28 '20 at 16:18
  • @FalkHüffner I am afraid I don't know what `__builtin_add_overflow_p()` does so I can try to replicate it here. What does this compile to? Can this be expressed in a portable fashion? Some explanatory remarks would be helpful. – njuffa Oct 28 '20 at 19:35
5

New answer, new idea. This one's based on the mathematical identity

floor((a+b+c)/3) = floor(x + (a+b+c - 3x)/3)

When does this work with machine integers and unsigned division?
When the difference doesn't wrap, i.e. 0 ≤ a+b+c - 3x ≤ T_MAX.

This definition of x is fast and gets the job done.

T avg3(T a, T b, T c) {
  T x = (a >> 2) + (b >> 2) + (c >> 2);
  return x + (a + b + c - 3 * x) / 3;
}

Weirdly, ICC inserts an extra neg unless I do this:

T avg3(T a, T b, T c) {
  T x = (a >> 2) + (b >> 2) + (c >> 2);
  return x + (a + b + c - (x + x * 2)) / 3;
}

Note that T must be at least five bits wide.

If T is two platform words long, then you can save some double word operations by omitting the low word of x.

Alternative version with worse latency but maybe slightly higher throughput?

T lo = a + b;
T hi = lo < b;
lo += c;
hi += lo < c;
T x = (hi << (sizeof(T) * CHAR_BIT - 2)) + (lo >> 2);
avg = x + (T)(lo - 3 * x) / 3;
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • On x86_64 I measure same throughput for this as for your previous solution, but worse latency (the long dependency chain is apparent in the code). Still, second-best latency of any variant tried. Might even be a win for latency on platforms with slower multipliers since only one multiply is needed. Nice solution, I am glad I placed the bounty. – njuffa Oct 31 '20 at 18:51
  • @njuffa Throughput is a little better for me with Clang on my late model MacBook (Intel(R) Core(TM) i7-1060NG7 CPU @ 1.20GHz). – David Eisenstat Oct 31 '20 at 19:23
  • I don't think you meant `/3.` (`double` constant as the divisor). You only seem to be using integer division in your actual implementations. (I wondered for a minute if Ice Lake's improved divider might be making the difference, but that's only for integer `div` / `idiv` which nobody's using, except for naive uint64_t division in 32-bit mode when GCC calls `__udivdi3`. Even if you had used `divsd`, it would have the same latency and throughput as Skylake, and similar to Ice Lake. https://uops.info/) – Peter Cordes Nov 01 '20 at 00:07
  • @PeterCordes That equality was intended as math (note the missing \*). I added the missing floor operations. – David Eisenstat Nov 01 '20 at 00:09
  • (That alternative version would have been close enough to my shot from the hip. Close to [Falk Hüffner's answer](https://stackoverflow.com/a/64571469/3789665), too.) – greybeard Nov 01 '20 at 14:55
5

I answered the question you linked to already, so I am only answering the part that is different about this one: performance.

If you really cared about performance, then the answer is:

( a + b + c ) / 3

Since you cared about performance, you should have an intuition about the size of data you are working with. You should not have worried about overflow on addition (multiplication is another matter) of only 3 values, because if your data is already big enough to use the high bits of your chosen data type, you are in danger of overflow anyway and should have used a larger integer type. If you are overflowing on uint64_t, then you should really ask yourself why exactly do you need to count accurately up to 18 quintillion, and perhaps consider using float or double.

Now, having said all that, I will give you my actual reply: It doesn't matter. The question doesn't come up in real life and when it does, perf doesn't matter.

It could be a real performance question if you are doing it a million times in SIMD, because there, you are really incentivized to use integers of smaller width and you may need that last bit of headroom, but that wasn't your question.

KevinZ
  • 3,036
  • 1
  • 18
  • 26
  • 2
    If you're paranoid, you could implement this with x86 `add rdi, rsi` / `jc safe_version` / `add rdi, rdx` / `jc safe_version` / ... to detect wraparound. On modern Intel CPUs, those add/jcc instructions will macro-fuse into add-and-branch uops. Still less efficient than just adding without checking, but makes a good optimistic fast-path if you expect the safe version to never or very rarely be needed. But yeah, most of the time this is the right answer, and the question is an XY problem. If you don't already have uint64_t, zero-extend your inputs to 64-bit for this if necessary. – Peter Cordes Nov 01 '20 at 00:18
3

I suspect SIMPLE is defeating the throughput benchmark by CSEing and hoisting a/3+b/3 and a%3+b%3 out of the loop, reusing those results for all 16 avg0..15 results.

(The SIMPLE version can hoist much more of the work than the tricky version; really just a ^ b and a & b in that version.)

Forcing the function to not inline introduces more front-end overhead, but does make your version win, as we expect it should on a CPU with deep out-of-order execution buffers to overlap independent work. There's lots of ILP to find across iterations, for the throughput benchmark. (I didn't look closely at the asm for the non-inline version.)

https://godbolt.org/z/j95qn3 (using __attribute__((noinline)) with clang -O3 -march=skylake on Godbolt's SKX CPUs) shows 2.58 nanosec throughput for the simple way, 2.48 nanosec throughput for your way. vs. 1.17 nanosec throughput with inlining for the simple version.

-march=skylake allows mulx for more flexible full-multiply, but otherwise no benefit from BMI2. andn isn't used; the line you commented with mulhi / andn is mulx into RCX / and rcx, -2 which only requires a sign-extended immediate.


Another way to do this without forcing call/ret overhead would be inline asm like in Preventing compiler optimizations while benchmarking (Chandler Carruth's CppCon talk has some example of how he uses a couple wrappers), or Google Benchmark's benchmark::DoNotOptimize.

Specifically, GNU C asm("" : "+r"(a), "+r"(b)) between each avgX = average_of_3 (a, b, avgX); statement will make the compiler forget everything it knows about the values of a and b, while keeping them in registers.

My answer on I don't understand the definition of DoNotOptimizeAway goes into more detail about using a read-only "r" register constraint to force the compiler to materialize a result in a register, vs. "+r" to make it assume the value has been modified.

If you understand GNU C inline asm well, it may be easier to roll your own in ways that you know exactly what they do.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Good point about the hoisting in the throughput benchmark, I should definitely check that. Thanks. If it occurs, it should occur for both code variants but possible to different degrees? Let me think about re-crafting the throughput portion of the benchmarking scaffold. I am a multi-architecture kind of guy and annotated with "conceptual" generic instruction names. Where I annotated MULHI, ANDN the ARM compilers use BIC which is just another name for ANDN. Since x86 does not have ANDN, compilers invert the immediate (as you note) which achieves the same effect. – njuffa Oct 28 '20 at 05:16
  • 1
    @njuffa: Yes, looked to me like a lot more of the work can get hoisted for the simple than the complex. x86 BMI1 *does* have [`andn`](https://www.felixcloutier.com/x86/andn), but it doesn't take an immediate. – Peter Cordes Oct 28 '20 at 05:18
  • Making 'a' and 'b' `volatile` seems to do the job just fine from carefully looking through the generated asm code. The only thing I see hoisted is the loading of the "magic multiplier" for the division-by-3 replacement sequence. I feel a little bit stupid right now because generally I *am* aware of hoisting issues in benchmark loops. I guess I was too exhausted today to notice, thanks for getting me back on the correct path! I will a wait couple of days to see whether any other answers come in (saving a few more instructions couldn't hurt :-) – njuffa Oct 28 '20 at 06:02
  • 1
    Oh, yes, `volatile` vars can work to force a reload from memory every time they're referenced. Another way is to use GNU C `asm("" : "+r"(a))` between each `avg4 = average_of_3 (a, b, avg4);` so `a` can stay in a register, but the compiler forgets everything it knows about its value. Various versions of that trick exist, including Google Benchmark's `Benchmark::DoNotOptimize`, and Chandler Carruth's `escape` and `clobber` wrappers mentioned in [Preventing compiler optimizations while benchmarking](https://stackoverflow.com/q/40122141) – Peter Cordes Oct 28 '20 at 06:03
  • @njuffa: If you expect your inputs to *usually* be small, simply accumulating into an `unsigned __int128` might be good, with `add/setc` + `add/adc reg,0`. I played around with that in https://godbolt.org/z/j95qn3, commented out. You can branch on the top half being zero. But if it is non-zero, `u128 / 3` compiles to a call to a helper function that uses plain `div r64` which is *very* slow on Intel. Some kind of trickery may be possible for `x/3 + (1 or 2<<64)/3` with correct rounding, possibly making it not worth branching if we can efficiently do that 66-bit division. – Peter Cordes Oct 28 '20 at 06:17
3

[Falk Hüffner points out in comments that this answer has similarities to his answer . Looking at his code more closely belatedly, I do find some similarities. However what I posted here is product of an independent thought process, a continuation of my original idea "reduce three items to two prior to div-mod". I understood Hüffner's approach to be different: "naive computation followed by corrections".]

I have found a better way than the CSA-technique in my question to reduce the division and modulo work from three operands to two operands. First, form the full double-word sum, then apply the division and modulo by 3 to each of the halves separately, finally combine the results. Since the most significant half can only take the values 0, 1, or 2, computing the quotient and remainder of division by three is trivial. Also, the combination into the final result becomes simpler.

Compared to the non-simple code variant from the question this achieves speedup on all platforms I examined. The quality of the code generated by compilers for the simulated double-word addition varies but is satisfactory overall. Nonetheless it may be worthwhile to code this portion in a non-portable way, e.g. with inline assembly.

T average_of_3_hilo (T a, T b, T c) 
{
    const T fives = (((T)(~(T)0)) / 3); // 0x5555...
    T avg, hi, lo, lo_div_3, lo_mod_3, hi_div_3, hi_mod_3; 
    /* compute the full sum a + b + c into the operand pair hi:lo */
    lo = a + b;
    hi = lo < a;
    lo = c + lo;
    hi = hi + (lo < c);
    /* determine quotient and remainder of each half separately */
    lo_div_3 = lo / 3;
    lo_mod_3 = (lo + lo_div_3) & 3;
    hi_div_3 = hi * fives;
    hi_mod_3 = hi;
    /* combine partial results into the division result for the full sum */
    avg = lo_div_3 + hi_div_3 + ((lo_mod_3 + hi_mod_3 + 1) / 4);
    return avg;
}
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • This seems pretty much identical to the version I posted, with the only difference being that I leave it to the compiler to calculate `lo_mod_3` cleverly, and I'm avoiding a multiplication in calculating `hi_div_3`. – Falk Hüffner Oct 29 '20 at 18:31
  • @FalkHüffner I am in the habit of carefully crediting any known influences on the answers I post to Stackoverflow (and my code in general). I have added a note at the top of the answer to preserve the content of your comment to acknowledge similarities. Is this satisfactory to you? – njuffa Oct 29 '20 at 19:35
  • Sure, no problem! I did not describe my code very well so I can see how this happened. – Falk Hüffner Oct 29 '20 at 20:09
  • `hi` can be `uint32_t` or narrower, even for `T = uint64_t`. Any 32-bit system would benefit if the compiler doesn't already spot all the cases where it can avoid updating an always-zero high half. Looks like `hi_div_3` does still need to be `T hi_div_3`, but others can be as narrow as `uint8_t`, including `hi_mod_3 + 1` before adding to `lo_mod_3`. Compilers aren't always good at keeping variables narrow so this might be worth manual attention. I don't expect a practical different for 64-bit builds on IvyBridge, though; just code-size there. – Peter Cordes Oct 29 '20 at 22:09
  • https://godbolt.org/z/5oKerG shows the difference with gcc -m32 -O3 -march=ivybridge. clang does narrow `hi` for you with `-m32`, same asm, but GCC doesn't. Of course, `call __udivdi3` is a performance disaster, so compiler output is not very usable for 32-bit mode. Probably extended-precision multiply would still be a win for division by a constant, but it's a lot of code to inline so compilers don't do it for you. – Peter Cordes Oct 29 '20 at 22:23
  • @PeterCordes Previous discussions with compiler engineers (multiple engineers over several years) have convinced me that to first order, integers want to be `int` unless there is a good reason for them to be anything else. This pertains in particular to signed-ness and size, with register-sized preferred for the latter. Over the years since then, I have found that advice to be very much applicable. It has also been pointed out to me that range tracking for integer variables is not something that compiler folks think is worthwhile (although it would often lead to better code, I boldly claim :-) – njuffa Oct 29 '20 at 22:25
  • I suggested an unsigned type for `hi` because zero-extending to 64-bit in software is typically cheaper than sign-extending. e.g. just `adc reg, 0` for the high half. Compilers do sometimes do some value-range analysis, e.g. clang `-m32` appears to be doing so to avoid implementing the upper half of `hi` even when it's declared as `T` (`unsigned long long`). – Peter Cordes Oct 29 '20 at 22:33
  • @PeterCordes I am not following, so presumably missing a relevant point. `T` is an unsigned integer type in the context of this question (for my current practical needs: either `uint32_t` or `uint64_t`), so certainly `hi` is an unsigned integer. I am generally aware of the benefits of zero extending vs sign extending. – njuffa Oct 29 '20 at 22:41
  • You said "integers want to be `int`", which is generally true, but in this case `unsigned hi` is probably cheaper than `int hi`, for `uint64_t lo` depending on how smart the compiler is. I'd been talking about making `hi` a narrower type to save code-size on x86-64, and save instructions on 32-bit builds with compilers that aren't as smart. – Peter Cordes Oct 29 '20 at 22:44
  • @PeterCordes Even if integers *generally* want to be `int`, in this case we cannot have that as use of unsigned integers is already a given per the interface. I take your point that, apparently contrary to my previous knowledge acquired from compiler folks, (some?) current x86 compilers *do* (at times, always?) generate better code if a narrower type is used. Thanks for pointing that out and sorry that I did not catch on to the salient point immediately. – njuffa Oct 29 '20 at 22:56
1

An experimental build of GCC-11 compiles the obvious naive function to something like:

uint32_t avg3t (uint32_t a, uint32_t b, uint32_t c) {
    a += b;
    b = a < b;
    a += c;
    b += a < c;

    b = b + a;
    b += b < a;
    return (a - (b % 3)) * 0xaaaaaaab;
}

Which is similar to some of the other answers posted here. Any explanation of how these solutions work would be welcome (not sure of the netiquette here).

aqrit
  • 792
  • 4
  • 14