Efficiently dividing unsigned value by a power of two, rounding up

Question

I want to implement unsigned^a integer division by an arbitrary power of two, rounding up, efficiently. So what I want, mathematically, is ceiling(p/q)⁰. In C, the strawman implementation, which doesn't take advantage of the restricted domain of q could something like the following function¹:

/** q must be a power of 2, although this version works for any q */
uint64_t divide(uint64_t p, uint64_t q) {
  uint64_t res = p / q;
  return p % q == 0 ? res : res + 1;
}

... of course, I don't actually want to use division or mod at the machine level, since that takes many cycles even on modern hardware. I'm looking for a strength reduction that uses shifts and/or some other cheap operation(s) - taking advantage of the fact that q is a power of 2.

You can assume we have an efficient lg(unsigned int x) function, which returns the base-2 log of x, if x is a power-of-two.

Undefined behavior is fine if q is zero.

Please note that the simple solution: (p+q-1) >> lg(q) doesn't work in general - try it with p == 2^64-100 and q == 256² for example.

Platform Details

I'm interested in solutions in C, that are likely to perform well across a variety of platforms, but for the sake of concreteness, awarding the bounty and because any definitive discussion of performance needs to include a target architecture, I'll be specific about how I'll test them:

Skylake CPU
gcc 5.4.0 with compile flags -O3 -march=haswell

Using gcc builtins (such as bitscan/leading zero builtins) is fine, and in particular I've implemented the lg() function I said was available as follows:

inline uint64_t lg(uint64_t x) {
  return 63U - (uint64_t)__builtin_clzl(x);
}

inline uint32_t lg32(uint32_t x) {
  return 31U - (uint32_t)__builtin_clz(x);
}

I verified that these compile down to a single bsr instruction, at least with -march=haswell, despite the apparent involvement of a subtraction. You are of course free to ignore these and use whatever other builtins you want in your solution.

Benchmark

I wrote a benchmark for the existing answers, and will share and update the results as changes are made.

Writing a good benchmark for a small, potentially inlined operation is quite tough. When code is inlined into a call site, a lot of the work of the function may disappear, especially when it's in a loop³.

You could simply avoid the whole inlining problem by ensuring your code isn't inlined: declare it in another compilation unit. I tried to that with the bench binary, but really the results are fairly pointless. Nearly all implementations tied at 4 or 5 cycles per call, but even a dummy method that does nothing other than return 0 takes the same time. So you are mostly just measuring the call + ret overhead. Furthermore, you are almost never really going to use the functions like this - unless you messed up, they'll be available for inlining and that changes everything.

So the two benchmarks I'll focus the most on repeatedly call the method under test in a loop, allowing inlining, cross-function optmization, loop hoisting and even vectorization.

There are two overall benchmark types: latency and throughput. The key difference is that in the latency benchmark, each call to divide is dependent on the previous call, so in general calls cannot be easily overlapped⁴:

uint32_t bench_divide_latency(uint32_t p, uint32_t q) {
    uint32_t total = p;                         
    for (unsigned i=0; i < ITERS; i++) {                
      total += divide_algo(total, q);                       
      q = rotl1(q);                         
    }
    return total;
  }

Note that the running total depends so on the output of each divide call, and that it is also an input to the divide call.

The throughput variant, on the other hand, doesn't feed the output of one divide into the subsequent one. This allows work from one call to be overlapped with a subsequent one (both by the compiler, but especially the CPU), and even allows vectorization:

uint32_t bench_divide_throughput(uint32_t p, uint32_t q) { 
    uint32_t total = p;                         
    for (unsigned i=0; i < ITERS; i++) {                
      total += fname(i, q);                     
      q = rotl1(q);                     
    }                                   
    return total;                           
  }

Note that here we feed in the loop counter as the the dividend - this is variable, but it doesn't depend on the previous divide call.

Furthermore, each benchmark has three flavors of behavior for the divisor, q:

Compile-time constant divisor. For example, a call to divide(p, 8). This is common in practice, and the code can be much simpler when the divisor is known at compile time.
Invariant divisor. Here the divisor is not know at compile time, but is constant for the whole benchmarking loop. This allows a subset of the optimizations that the compile-time constant does.
Variable divisor. The divisor changes on each iteration of the loop. The benchmark functions above show this variant, using a "rotate left 1" instruction to vary the divisor.

Combining everything you get a total of 6 distinct benchmarks.

Results

Overall

For the purposes of picking an overall best algorithm, I looked at each of 12 subsets for the proposed algorithms: (latency, throughput) x (constant a, invariant q, variable q) x (32-bit, 64-bit) and assigned a score of 2, 1, or 0 per subtest as follows:

The best algorithm(s) (within 5% tolerance) receive a score of 2.
The "close enough" algorithms (no more than 50% slower than the best) receive a score of 1.
The remaining algorithms score zero.

Hence, the maximum total score is 24, but no algorithm achieved that. Here are the overall total results:

╔═══════════════════════╦═══════╗
║       Algorithm       ║ Score ║
╠═══════════════════════╬═══════╣
║ divide_user23_variant ║    20 ║
║ divide_chux           ║    20 ║
║ divide_user23         ║    15 ║
║ divide_peter          ║    14 ║
║ divide_chrisdodd      ║    12 ║
║ stoke32               ║    11 ║
║ divide_chris          ║     0 ║
║ divide_weather        ║     0 ║
╚═══════════════════════╩═══════╝

So the for the purposes of this specific test code, with this specific compiler and on this platform, user2357112 "variant" (with ... + (p & mask) != 0) performs best, tied with chux's suggestion (which is in fact identical code).

Here are all the sub-scores which sum to the above:

╔══════════════════════════╦═══════╦════╦════╦════╦════╦════╦════╗
║                          ║ Total ║ LC ║ LI ║ LV ║ TC ║ TI ║ TV ║
╠══════════════════════════╬═══════╬════╬════╬════╬════╬════╬════╣
║ divide_peter             ║     6 ║  1 ║  1 ║  1 ║  1 ║  1 ║  1 ║
║ stoke32                  ║     6 ║  1 ║  1 ║  2 ║  0 ║  0 ║  2 ║
║ divide_chux              ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_user23            ║     8 ║  1 ║  1 ║  2 ║  2 ║  1 ║  1 ║
║ divide_user23_variant    ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_chrisdodd         ║     6 ║  1 ║  1 ║  2 ║  0 ║  0 ║  2 ║
║ divide_chris             ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
║ divide_weather           ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
║                          ║       ║    ║    ║    ║    ║    ║    ║
║ 64-bit Algorithm         ║       ║    ║    ║    ║    ║    ║    ║
║ divide_peter_64          ║     8 ║  1 ║  1 ║  1 ║  2 ║  2 ║  1 ║
║ div_stoke_64             ║     5 ║  1 ║  1 ║  2 ║  0 ║  0 ║  1 ║
║ divide_chux_64           ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_user23_64         ║     7 ║  1 ║  1 ║  2 ║  1 ║  1 ║  1 ║
║ divide_user23_variant_64 ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_chrisdodd_64      ║     6 ║  1 ║  1 ║  2 ║  0 ║  0 ║  2 ║
║ divide_chris_64          ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
║ divide_weather_64        ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
╚══════════════════════════╩═══════╩════╩════╩════╩════╩════╩════╝

Here, each test is named like XY, with X in {Latency, Throughput} and Y in {Constant Q, Invariant Q, Variable Q}. So for example, LC is "Latency test with constant q".

Analysis

At the highest level, the solutions can be roughly divided into two categories: fast (the top 6 finishers) and slow (the bottom two). The difference is larger: all of the fast algorithms were the fastest on at least two subtests and in general when they didn't finish first they fell into the "close enough" category (they only exceptions being failed vectorizations in the case of stoke and chrisdodd). The slow algorithms however scored 0 (not even close) on every test. So you can mostly eliminate the slow algorithms from further consideration.

Auto-vectorization

Among the fast algorithms, a large differentiator was the ability to auto-vectorize.

None of the algorithms were able to auto-vectorize in the latency tests, which makes sense since the latency tests are designed to feed their result directly into the next iteration. So you can really only calculate results in a serial fashion.

For the throughput tests, however, many algorithms were able to auto-vectorize for the constant Q and invariant Q case. In both of these tests tests the divisor q is loop-invariant (and in the former case it is a compile-time constant). The dividend is the loop counter, so it is variable, but predicable (and in particular a vector of dividends can be trivially calculated by adding 8 to the previous input vector: [0, 1, 2, ..., 7] + [8, 8, ..., 8] == [8, 9, 10, ..., 15]).

In this scenario, gcc was able to vectorize peter, stoke, chux, user23 and user23_variant. It wasn't able to vectorize chrisdodd for some reason, likely because it included a branch (but conditionals don't strictly prevent vectorization since many other solutions have conditional elements but still vectorized). The impact was huge: algorithms that vectorized showed about an 8x improvement in throughput over variants that didn't but were otherwise fast.

Vectorization isn't free, though! Here are the function sizes for the "constant" variant of each function, with the Vec? column showing whether a function vectorized or not:

Size Vec? Name
 045    N bench_c_div_stoke_64
 049    N bench_c_divide_chrisdodd_64
 059    N bench_c_stoke32_64
 212    Y bench_c_divide_chux_64
 227    Y bench_c_divide_peter_64
 220    Y bench_c_divide_user23_64
 212    Y bench_c_divide_user23_variant_64

The trend is clear - vectorized functions take about 4x the size of the non-vectorized ones. This is both because the core loops themselves are larger (vector instructions tend to be larger and there are more of them), and because loop setup and especially the post-loop code is much larger: for example, the vectorized version requires a reduction to sum all the partial sums in a vector. The loop count is fixed and a multiple of 8, so no tail code is generated - but if were variable the generated code would be even larger.

Furthermore, despite the large improvement in runtime, gcc's vectorization is actually poor. Here's an excerpt from the vectorized version of Peter's routine:

  on entry: ymm4 == all zeros
  on entry: ymm5 == 0x00000001 0x00000001 0x00000001 ...
  4007a4:       c5 ed 76 c4             vpcmpeqd ymm0,ymm2,ymm4
  4007ad:       c5 fd df c5             vpandn   ymm0,ymm0,ymm5
  4007b1:       c5 dd fa c0             vpsubd   ymm0,ymm4,ymm0
  4007b5:       c5 f5 db c0             vpand    ymm0,ymm1,ymm0

This chunk works independently on 8 DWORD elements originating in ymm2. If we take x to be a single DWORD element of ymm2, and y the incoming value of ymm1 these foud instructions correspond to:

                    x == 0   x != 0
x  = x ? 0 : -1; //     -1        0
x  = x & 1;      //      1        0
x  = 0 - x;      //     -1        0
x  = y1 & x;     //     y1        0

So the first three instructions could simple be replaced by the first one, as the states are identical in either case. So that's two cycles added to that dependency chain (which isn't loop carried) and two extra uops. Evidently gcc's optimization phases somehow interact poorly with the vectorization code here, since such trivial optimizations are rarely missed in scalar code. Examining the other vectorized versions similarly shows a lot of performance dropped on the floor.

Branches vs Branch-free

Nearly all of the solutions compiled to branch-free code, even if C code had conditionals or explicit branches. The conditional portions were small enough that the compiler generally decided to use conditional move or some variant. One exception is chrisdodd which compiled with a branch (checking if p == 0) in all the throughput tests, but none of the latency ones. Here's a typical example from the constant q throughput test:

0000000000400e60 <bench_c_divide_chrisdodd_32>:
  400e60:       89 f8                   mov    eax,edi
  400e62:       ba 01 00 00 00          mov    edx,0x1
  400e67:       eb 0a                   jmp    400e73 <bench_c_divide_chrisdodd_32+0x13>
  400e69:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
  400e70:       83 c2 01                add    edx,0x1
  400e73:       83 fa 01                cmp    edx,0x1
  400e76:       74 f8                   je     400e70 <bench_c_divide_chrisdodd_32+0x10>
  400e78:       8d 4a fe                lea    ecx,[rdx-0x2]
  400e7b:       c1 e9 03                shr    ecx,0x3
  400e7e:       8d 44 08 01             lea    eax,[rax+rcx*1+0x1]
  400e82:       81 fa 00 ca 9a 3b       cmp    edx,0x3b9aca00
  400e88:       75 e6                   jne    400e70 <bench_c_divide_chrisdodd_32+0x10>
  400e8a:       c3                      ret    
  400e8b:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

The branch at 400e76 skips the case that p == 0. In fact, the compiler could have just peeled the first iteration out (calculating its result explicitly) and then avoided the jump entirely since after that it can prove that p != 0. In these tests, the branch is perfectly predictable, which could give an advantage to code that actually compiles using a branch (since the compare & branch code is essentially out of line and close to free), and is a big part of why chrisdodd wins the throughput, variable q case.

Detailed Test Results

Here you can find some detailed test results and some details on the tests themselves.

Latency

The results below test each algorithm over 1e9 iterations. Cycles are calculated simply by multiplying the time/call by the clock frequency. You can generally assume that something like 4.01 is the same as 4.00, but the larger deviations like 5.11 seem to be real and reproducible.

The results for divide_plusq_32 use (p + q - 1) >> lg(q) but are only shown for reference, since this function fails for large p + q. The results for dummy are a very simple function: return p + q, and lets you estimate the benchmark overhead⁵ (the addition itself should take a cycle at most).

==============================
Bench: Compile-time constant Q
==============================
                  Function         ns/call    cycles
           divide_peter_32            2.19      5.67
           divide_peter_64            2.18      5.64
                stoke32_32            1.93      5.00
                stoke32_64            1.97      5.09
              stoke_mul_32            2.75      7.13
              stoke_mul_64            2.34      6.06
              div_stoke_32            1.94      5.03
              div_stoke_64            1.94      5.03
            divide_chux_32            1.55      4.01
            divide_chux_64            1.55      4.01
          divide_user23_32            1.97      5.11
          divide_user23_64            1.93      5.00
  divide_user23_variant_32            1.55      4.01
  divide_user23_variant_64            1.55      4.01
       divide_chrisdodd_32            1.95      5.04
       divide_chrisdodd_64            1.93      5.00
           divide_chris_32            4.63     11.99
           divide_chris_64            4.52     11.72
         divide_weather_32            2.72      7.04
         divide_weather_64            2.78      7.20
           divide_plusq_32            1.16      3.00
           divide_plusq_64            1.16      3.00
           divide_dummy_32            1.16      3.00
           divide_dummy_64            1.16      3.00


==============================
Bench: Invariant Q
==============================
                  Function         ns/call    cycles
           divide_peter_32            2.19      5.67
           divide_peter_64            2.18      5.65
                stoke32_32            1.93      5.00
                stoke32_64            1.93      5.00
              stoke_mul_32            2.73      7.08
              stoke_mul_64            2.34      6.06
              div_stoke_32            1.93      5.00
              div_stoke_64            1.93      5.00
            divide_chux_32            1.55      4.02
            divide_chux_64            1.55      4.02
          divide_user23_32            1.95      5.05
          divide_user23_64            2.00      5.17
  divide_user23_variant_32            1.55      4.02
  divide_user23_variant_64            1.55      4.02
       divide_chrisdodd_32            1.95      5.04
       divide_chrisdodd_64            1.93      4.99
           divide_chris_32            4.60     11.91
           divide_chris_64            4.58     11.85
         divide_weather_32           12.54     32.49
         divide_weather_64           17.51     45.35
           divide_plusq_32            1.16      3.00
           divide_plusq_64            1.16      3.00
           divide_dummy_32            0.39      1.00
           divide_dummy_64            0.39      1.00


==============================
Bench: Variable Q
==============================
                  Function         ns/call    cycles
           divide_peter_32            2.31      5.98
           divide_peter_64            2.26      5.86
                stoke32_32            2.06      5.33
                stoke32_64            1.99      5.16
              stoke_mul_32            2.73      7.06
              stoke_mul_64            2.32      6.00
              div_stoke_32            2.00      5.19
              div_stoke_64            2.00      5.19
            divide_chux_32            2.04      5.28
            divide_chux_64            2.05      5.30
          divide_user23_32            2.05      5.30
          divide_user23_64            2.06      5.33
  divide_user23_variant_32            2.04      5.29
  divide_user23_variant_64            2.05      5.30
       divide_chrisdodd_32            2.04      5.30
       divide_chrisdodd_64            2.05      5.31
           divide_chris_32            4.65     12.04
           divide_chris_64            4.64     12.01
         divide_weather_32           12.46     32.28
         divide_weather_64           19.46     50.40
           divide_plusq_32            1.93      5.00
           divide_plusq_64            1.99      5.16
           divide_dummy_32            0.40      1.05
           divide_dummy_64            0.40      1.04

Throughput

Here are the results for the throughput tests. Note that many of the algorithms here were auto-vectorized, so the performance is relatively very good for those: a fraction of a cycle in many cases. One result is that unlike most latency results, the 64-bit functions are considerably slower, since vectorization is more effective with smaller element sizes (although the gap is larger that I would have expected).

==============================
Bench: Compile-time constant Q
==============================
                  Function         ns/call    cycles
                stoke32_32            0.39      1.00
            divide_chux_32            0.15      0.39
            divide_chux_64            0.53      1.37
          divide_user23_32            0.14      0.36
          divide_user23_64            0.53      1.37
  divide_user23_variant_32            0.15      0.39
  divide_user23_variant_64            0.53      1.37
       divide_chrisdodd_32            1.16      3.00
       divide_chrisdodd_64            1.16      3.00
           divide_chris_32            4.34     11.23
           divide_chris_64            4.34     11.24
         divide_weather_32            1.35      3.50
         divide_weather_64            1.35      3.50
           divide_plusq_32            0.10      0.26
           divide_plusq_64            0.39      1.00
           divide_dummy_32            0.08      0.20
           divide_dummy_64            0.39      1.00


==============================
Bench: Invariant Q
==============================
                  Function         ns/call    cycles
                stoke32_32            0.48      1.25
            divide_chux_32            0.15      0.39
            divide_chux_64            0.48      1.25
          divide_user23_32            0.17      0.43
          divide_user23_64            0.58      1.50
  divide_user23_variant_32            0.15      0.38
  divide_user23_variant_64            0.48      1.25
       divide_chrisdodd_32            1.16      3.00
       divide_chrisdodd_64            1.16      3.00
           divide_chris_32            4.35     11.26
           divide_chris_64            4.36     11.28
         divide_weather_32            5.79     14.99
         divide_weather_64           17.00     44.02
           divide_plusq_32            0.12      0.31
           divide_plusq_64            0.48      1.25
           divide_dummy_32            0.09      0.23
           divide_dummy_64            0.09      0.23


==============================
Bench: Variable Q
==============================
                  Function         ns/call    cycles
                stoke32_32            1.16      3.00
            divide_chux_32            1.36      3.51
            divide_chux_64            1.35      3.50
          divide_user23_32            1.54      4.00
          divide_user23_64            1.54      4.00
  divide_user23_variant_32            1.36      3.51
  divide_user23_variant_64            1.55      4.01
       divide_chrisdodd_32            1.16      3.00
       divide_chrisdodd_64            1.16      3.00
           divide_chris_32            4.02     10.41
           divide_chris_64            3.84      9.95
         divide_weather_32            5.40     13.98
         divide_weather_64           19.04     49.30
           divide_plusq_32            1.03      2.66
           divide_plusq_64            1.03      2.68
           divide_dummy_32            0.63      1.63
           divide_dummy_64            0.66      1.71

^a At least by specifying unsigned we avoid the whole can of worms related to the right-shift behavior of signed integers in C and C++.

⁰ Of course, this notation doesn't actually work in C where / truncates the result so the ceiling does nothing. So consider that pseudo-notation rather than straight C.

¹ I'm also interested solutions where all types are uint32_t rather than uint64_t.

² In general, any p and q where p + q >= 2^64 causes an issue, due to overflow.

³ That said, the function should be in a loop, because the performance of a microscopic function that takes half a dozen cycles only really matters if it is called in a fairly tight loop.

⁴ This is a bit of a simplification - only the dividend p is dependent on the output of the previous iteration, so some work related to processing of q can still be overlapped.

⁵ Use such estimates with caution however - overhead isn't simply additive. If the overhead shows up as 4 cycles and some function f takes 5, it's likely not accurate to say the cost of the real work in f is 5 - 4 == 1, because of the way execution is overlapped.

Considering that 0 isn't a power of 2, `q == 0` violates the preconditions anyway. — user2357112, Nov 04 '16 at 21:01
`t = (p >> lg(q)); t + (((p - (t << (p >> lg(q)))) >= (lg(q) / 2)) ? 1 : 0)` Maybe could even be a bit better with bitmasking. Basically you just compute the fractional remainder and see if is >= 1/2 — Norbert Lange, Nov 04 '16 at 21:02
`0xC000000000000000` isn't a power of 2 either, though. Is `q` required to be a power of 2 or not? Your example doesn't seem consistent. — user2357112, Nov 04 '16 at 21:03
@user2357112 - indeed, but I'm making it explicit that it's OK to trigger _UB_ in this case. Other functions are nicer - if you violate the preconditions, you might expect an exception, an error code, etc. So the distinction may be important. — BeeOnRope, Nov 04 '16 at 21:04
Might be slow: `while(q>1){ p=(p+1)>>1; q=q>>1;}` should give something close. — Lutz Lehmann, Nov 04 '16 at 21:05
@NorbertLange - I feel like you didn't real the whole question :). Furthermore, this trivially fails for `3 / 2` : `(3 >> lq(2)) == (3 >> 1) == 1`. The correct answer, of course is 2. — BeeOnRope, Nov 04 '16 at 21:05
@WeatherVane - note I'm asking for _rounding up_, i.e., towards positive infinity. Your method rounds nearest or something like that. I'll edit the question to make it clearer. — BeeOnRope, Nov 04 '16 at 21:06
It's kind of annoying how `q == 1` seems to require special handling. I can't see a good way to make the round-up logic work for that case without branching. — user2357112, Nov 04 '16 at 21:08
Sorry, I realized that while the text was OK, the demo code was totally broken (of course it implements _truncating_ division)! I've fixed it. — BeeOnRope, Nov 04 '16 at 21:09
@WeatherVane - this fails as i described above. Try it for `1.4e19`. — BeeOnRope, Nov 04 '16 at 21:09
@BeeOnRope: 1.4e19 is a double, and also not even close to a power of 2. — user2357112, Nov 04 '16 at 21:11
@user2357112 - good point about 0xC0...! The example if fixed, check it now. — BeeOnRope, Nov 04 '16 at 21:12
@WeatherVane - sorry for wasting your time. I've fixed the example. FWIW though, `1.4e19` is an unsigned value, is it not? — BeeOnRope, Nov 04 '16 at 21:13
hopefully fixed: `t = (p >> lg(q)); t + (((p - (t << lg(q))) >= (lg(q) / 2)) ? 1 : 0)` — Norbert Lange, Nov 04 '16 at 21:13
Yes, that's true, I took it to mean `unsigned` type, because the first sentence says **"I want to implement unsigned integer division."** And your example code contains only integer types. My apologies for not realising you want floating point values. — Weather Vane, Nov 04 '16 at 21:13
@WeatherVane - as in the question footnote, I should clarify that I'm interested in different sizes of integers too. I think there is a good solution for integers smaller than the machine word size, but I don't see a good one for the type the same as the word size, which is why I used it as the example. — BeeOnRope, Nov 04 '16 at 21:17
My suggestion works for all unsigned integer types, as long as there is no overflow when (divisor - 1) is added. — Weather Vane, Nov 04 '16 at 21:18
@WeatherVane - sorry for even more confusion. _I am interested in only integer types!_ 1.4e19 **is** an integer. It's just easier for me to write that than typing out 140000000000000000000. So the _e_ is only a convenience of notation. It's not a floating point value. In any case, I have removed all such notation from my question. All values are obviously integral now. — BeeOnRope, Nov 04 '16 at 21:18
So please explain why my suggestion fails, if all values are contained within the type specified. — Weather Vane, Nov 04 '16 at 21:20
@NorbertLange - sounds like it might be an answer, more than a comment. — BeeOnRope, Nov 04 '16 at 21:20
@WeatherVane - you already identified the failure yourself: it fails due to overflow. It fails the example I gave in my question. — BeeOnRope, Nov 04 '16 at 21:20
@BeeOnRope Yup erased it before you replied. Lol not thinking right. — mascoj, Nov 04 '16 at 21:23
@BeeOnRope Though I meant to type (p >> lg(q)) + (p & 1), which in your case would've been (2 >> 0) + (2 & 1) = 2 + 0 = 2 — mascoj, Nov 04 '16 at 21:26
@mascoj - it fails for `p=6` and `q=2`, giving 2 rather than 3. — BeeOnRope, Nov 04 '16 at 21:32
@user2357112 - regarding "1.4e19 is a double..." - I was using the notation in in a mathematical sense, not in a "C literal notation" sense. A number can both be an integer (math sense) and a `double` in the C sense, of course. You are right it's not a power of 2. My example and notation was misguided. I've made it clearer now. — BeeOnRope, Nov 04 '16 at 21:42
Bounty clarification: you mean gcc5.4 with `-O3 -march=skylake`, right? So BMI2 is available, and the code doesn't have to run on earlier CPUs. And I assume you're ruling out inline asm, because that's good for winning now, but bad for long-term maintainability and defeats constant propagation. (https://gcc.gnu.org/wiki/DontUseInlineAsm) — Peter Cordes, Nov 07 '16 at 12:38
Is fails-for-`dividend==0` at all interesting to you? Unless there's some totally different way to approach this, QuestionC's version is probably faster than anything else (5 insns with 5c latency) and works for all unsigned values other than 0. (Or correctly for *all* signed values, I think.) — Peter Cordes, Nov 07 '16 at 12:45
(when I said "the code doesn't have to run on earlier CPUs", I just meant the binary. Of course the source should be compilable without any `-march`.) — Peter Cordes, Nov 07 '16 at 12:48
Yes, I want `dividend==0` to work correctly. Still I'm interested also interested in special case solutions, e.g., where some values don't work correctly, if they are a notable improvement over the general version. For the purposes of the bounty though, it should work correctly for all inputs. For the bounty I'm using `-03 -march=haswell` and I am running it on Skylake. Stick to C and gcc builtins for the bounty, but of course I'm also interested in asm versions that beat the C equivalent. The accepted answer should be all-around good and may be different than the bounty winner! If — BeeOnRope, Nov 07 '16 at 15:30
About special case solutions - they should have a _reasonable_ restricted set of values that don't work correctly, to be usable! For example, not working when `dividend == 0` would qualify as _reasonably_ restricted, as would when `(p + q - 1) > UINT_MAX` (even though the latter covers _a lot_ of values, they are restricted to the upper end of the uint range, and you might well know you don't approach those values in practice). If it fails randomly for values in the middle of the range it's not going to be very helpful. — BeeOnRope, Nov 07 '16 at 15:32
Will your speed-test have `dividend==0` with any significant frequency? If no, branching on it is better than the longer dep chain to do it branchlessly, for algos where it is a special case. Which of latency, throughput, or total uop count will your benchmark be sensitive to? (e.g. summing the results to just measure throughput, or using the output of one as the input to the next to measure latency?) As far as real use cases, it can imagine either might be relevant. And for throughput, fused-domain uop count might be most relevant when this is used as part of a larger loop body. — Peter Cordes, Nov 07 '16 at 16:03
I am benchmarking latency (i.e., dependent back-to-back calls) as well as throughput (independent back-to-back calls). About `dividend == 0` I haven't decided. I didn't think it mattered because all of the good solutions seemed to be branchless. Oddly your godbolt link shows Chris Dodd's ternary as using a branch, but locally for me it doesn't ever have one when inlined (it uses conditional moves instead) not sure why. — BeeOnRope, Nov 07 '16 at 16:15
@BeeOnRope: Writing it with an `if(d==0) return 0` will make it more likely to branch, which is what you want if d==0 is maybe <5% or >95%. Branches in real code are extremely common, and CPUs are quite good at handling them. Branchless gives consistency, but isn't always better on average. This is especially true in cases where the branchless version uses a CMOV (2uops 2c latency on pre-Broadwell), which is the case Linus Torvalds was writing about in [these posts](http://yarchive.net/comp/linux/cmov.html), saying that branchless is not "obviously better"; it depends on the case. — Peter Cordes, Nov 07 '16 at 16:19
`gcc` uses the fact of whether code is written with `if` vs `?:` as a hint to whether the code will be predictable? I thought when optimization was turned on they were more or less equivalent. — BeeOnRope, Nov 07 '16 at 16:25
Are you interested in a SIMD-vectorized version of this? Actually that would probably only be reasonable with AVX512CD for SIMD lzcnt, but that's the only missing piece of the puzzle and you could maybe have vectors of shift-counts instead of vectors of divisors. Or maybe SIMD binary-search for the right shift count, with VPCMPGT and then using the mask result to modify a vector of shift counts for VPSLLVQ, which you use to shift a vector of `1`s, feeding back into the PCMPGT — Peter Cordes, Nov 07 '16 at 16:25
Well, putting an `if(d==0) return 0` at the top of the function makes gcc branch before the tzcnt instead of after. (But it also makes gcc duplicate the `xor eax,eax`, /facepalm). I was thinking of `if(tmp>max) max=tmp;` vs. `max=tmp>max? tmp : max`, where the ternary version always writes to max, but the branch doesn't. That's not what's happening here, so that usual "ternary enables cmov" rule of thumb probably doesn't apply. Just to confirm, when you compile the function stand-alone, does gcc branch? And it only converts it to branchless when inlining? — Peter Cordes, Nov 07 '16 at 16:31
@PeterCordes ... but an optimizing compiler is working at a more abstract level, liberally using the "as if" rule, so the "literal" interpretation of the code like "the ternary version always write to max" don't really matter for good compilers. The compiler is free to change the `if` version into one that always writes, and is free to change the ternary version into one that jumps and only writes on one path. In fact, after optimization they may be represented identically internally. I don't know if the "ternary enables `cmov`" rule worked in the past but I don't see it working much today. — BeeOnRope, Nov 07 '16 at 18:06
@PeterCordes - yes, I confirm that when I confirm that when I compile it standalone, it has the branch. Weather's answer also has a branch standalone, but the remainder seem to favor conditional sets/moves. — BeeOnRope, Nov 07 '16 at 20:28
Your latency benchmark only tests p->result latency. The next `q` is ready early, so tzcnt and whatever can be ready before the previous divide result. If that's what your real use-case is like, then that's good. There are so many free parameters (esp. when you account for vectorizability) that I don't think there's one single best-for-all-cases version. (Although one version can be good for many cases.) — Peter Cordes, Nov 08 '16 at 22:47
Yes, I mention it explicitly in the text (see footnote 4). I also tested a version that makes both `p` and `q` depend on the previous iteration, but I didn't want to post an overwhelming number of results. My use cases are varied, but I did have at least one where both `p` and `q` depending on some earlier division. In the end I chose this way because my "compile time" and "invariant" flavors of the loop aren't consistent with involving `q` in the latency chain. Involving only `q` makes all the variants relevant. — BeeOnRope, Nov 08 '16 at 22:51
It's interesting how the Chris Dodd implementations showed almost no speedup for constant q, but won the throughput comparisons for variable q. — user2357112, Nov 09 '16 at 01:05
@user2357112 - yeah, it's artifact of the vectorization. The vectorizer gave up on vectorizing Chris D's solution, probably because of the implied branch (although that's glossing over a bit of detail since some other answers have an implied branch too, but it still vectorized). Other solutions (the ones with < 2 cycle throughput) all got vectorized for constant and invariant `q`. With the varying `q` (at least the way I varied it), the vectorizer gives up and then Chris code does well. — BeeOnRope, Nov 09 '16 at 01:12
In particular, it does well on throuput for some of the reasons pointed out by @PeterCordes near the start of his post (and in detail later on) - in the latency test it's a bit handicapped by the `p -> result` latency, but in the throughput test this isn't a factor. I added assembly dumps to the git project. Here's the [latency dump](https://github.com/travisdowns/ceiling_dev/blob/master/latency.lst) and the [throughput one](https://github.com/travisdowns/ceiling_dev/blob/master/throughput.lst). — BeeOnRope, Nov 09 '16 at 01:14
Since post is seeking an answer on a few compilers rather than a general C answer, post should be tagged as such. — chux - Reinstate Monica, Nov 10 '16 at 13:20
BTW: Good that you are validating the _functionality_ of proposed solutions over the entire range of valid inputs. Too often various optimizations posts include answers that are _fast_ but functionally incomplete/wrong. — chux - Reinstate Monica, Nov 10 '16 at 13:40
IIRC, there's a special function in gcc to run machine code directly; I also seem to recall that it's man page starts with "Never use this function", but It might be handy here. — SIGSTACKFAULT, Nov 13 '16 at 02:24
@Blacksilver - indeed, most compilers offer some way to integrate "inline" asm (MSVC 64-bit was a notable exception last time I looked, although perhaps they've added that feature by now). It's not often going to be very useful for performance for small *visible* functions since using it effectively disabled a host of compiler optimizations, starting with inlining, that prove very useful. By *visible* I mean that the compiler can see the definition at compile-time (and can inline). For not visible functions, or for large ones (ie, with loops) it might work. Anyway, it's a C question :) — BeeOnRope, Nov 13 '16 at 18:08
@chux - yeah I am a fan of exhaustive validation. I've found several undetected subtle bugs by applying it to code I had otherwise assumed correct. Unfortunately, it's not really feasible much above "40-bits" of input. Here the 32-bit function takes `32 + 5 = 37` bits of input since the second argument only has 2^5 valid values. So I can validate it exhaustively in about 10-20 seconds. The 64-bit function is out of range though! — BeeOnRope, Nov 13 '16 at 18:17
@chux - about the tagging, what tag would you like to see added? To be clear, I was and am interested primarily in a C solution (built-in use OK). I want it to work well across different targets and compilers. As a _practical matter_ I need to define at least one platform to actually measure on, so I chose `gcc` Linux as a pragmatic choice (widely used, free). The question isn't limited to that platform however! If I had the time, hardware and compiler access, I'd test on more platforms and I also welcome test results from the community. — BeeOnRope, Nov 13 '16 at 18:21
If the goal is truly a generic solution, better to not add compiler/platform tags. — chux - Reinstate Monica, Nov 14 '16 at 15:59
Well I definitely don't want a compiler tag on there, bug given the amount of x86 specific analysis, especially Peter's great answer I think it would be doing a disservice for those who are looking at the `x86` to omit. On error above, I sad the functions validate in 10-20 seconds, but I should have said 200-300 seconds! That seems reasonable - there are 137 billion inputs, so each input is taking somewhere around 2 ns for two calls (the reference routine, and the routine being validated) which seems reasonable. Fast even... — BeeOnRope, Nov 14 '16 at 18:15
@user2357112: I wonder if your mask-twiddling trick is the best we can do [on a GPU](http://stackoverflow.com/questions/43564727/efficiently-dividing-unsigned-value-by-a-power-of-two-rounding-up-in-cuda). — einpoklum, Apr 22 '17 at 21:38

Peter Cordes · Accepted Answer · 2016-11-26T22:25:40.647

This answer is about what's ideal in asm; what we'd like to convince the compiler to emit for us. (I'm not suggesting actually using inline asm, except as a point of comparison when benchmarking compiler output. https://gcc.gnu.org/wiki/DontUseInlineAsm).

I did manage to get pretty good asm output from pure C for ceil_div_andmask, see below. (It's worse than a CMOV on Broadwell/Skylake, but probably good on Haswell. Still, the user23/chux version looks even better for both cases.) It's mostly just worth mentioning as one of the few cases where I got the compiler to emit the asm I wanted.

It looks like Chris Dodd's general idea of return ((p-1) >> lg(q)) + 1 with special-case handling for d=0 is one of the best options. I.e. the optimal implementation of it in asm is hard to beat with an optimal implementation of anything else. Chux's (p >> lg(q)) + (bool)(p & (q-1)) also has advantages (like lower latency from p->result), and more CSE when the same q is used for multiple divisions. See below for a latency/throughput analysis of what gcc does with it.

If the same e = lg(q) is reused for multiple dividends, or the same dividend is reused for multiple divisors, different implementations can CSE more of the expression. They can also effectively vectorize with AVX2.

Branches are cheap and very efficient if they predict very well, so branching on d==0 will be best if it's almost never taken. If d==0 is not rare, branchless asm will perform better on average. Ideally we can write something in C that will let gcc make the right choice during profile-guided optimization, and compiles to good asm for either case.

Since the best branchless asm implementations don't add much latency vs. a branchy implementation, branchless is probably the way to go unless the branch would go the same way maybe 99% of the time. This might be likely for branching on p==0, but probably less likely for branching on p & (q-1).

It's hard to guide gcc5.4 into emitting anything optimal. This is my work-in-progress on Godbolt).

I think the optimal sequences for Skylake for this algorithm are as follows. (Shown as stand-alone functions for the AMD64 SysV ABI, but talking about throughput/latency on the assumption that the compiler will emit something similar inlined into a loop, with no RET attached).

Branch on carry from calculating d-1 to detect d==0, instead of a separate test & branch. Reduces the uop count nicely, esp on SnB-family where JC can macro-fuse with SUB.

ceil_div_pjc_branch:
 xor    eax,eax          ; can take this uop off the fast path by adding a separate xor-and-return block, but in reality we want to inline something like this.
 sub    rdi, 1
 jc    .d_was_zero       ; fuses with the sub on SnB-family
 tzcnt  rax, rsi         ; tzcnt rsi,rsi also avoids any false-dep problems, but this illustrates that the q input can be read-only.
 shrx   rax, rdi, rax
 inc    rax
.d_was_zero:
 ret

Fused-domain uops: 5 (not counting ret), and one of them is an xor-zero (no execution unit)
HSW/SKL latency with successful branch prediction:
- (d==0): No data dependency on d or q, breaks the dep chain. (control dependency on d to detect mispredicts and retire the branch).
- (d!=0): q->result: tzcnt+shrx+inc = 5c
- (d!=0): d->result: sub+shrx+inc = 3c
Throughput: probably just bottlenecked on uop throughput

I've tried but failed to get gcc to branch on CF from the subtract, but it always wants to do a separate comparison. I know gcc can be coaxed into branching on CF after subtracting two variables, but maybe this fails if one is a compile-time constant. (IIRC, this typically compiles to a CF test with unsigned vars: foo -= bar; if(foo>bar) carry_detected = 1;)

Branchless with ADC / SBB to handle the d==0 case. Zero-handling adds only one instruction to the critical path (vs. a version with no special handling for d==0), but also converts one other from an INC to a sbb rax, -1 to make CF undo the -= -1. Using a CMOV is cheaper on pre-Broadwell, but takes extra instructions to set it up.

ceil_div_pjc_asm_adc:
 tzcnt  rsi, rsi
 sub    rdi, 1
 adc    rdi, 0          ; d? d-1 : d.  Sets CF=CF
 shrx   rax, rdi, rsi
 sbb    rax, -1         ; result++ if d was non-zero
 ret

Fused-domain uops: 5 (not counting ret) on SKL. 7 on HSW
SKL latency:
- q->result: tzcnt+shrx+sbb = 5c
- d->result: sub+adc+shrx(dep on q begins here)+sbb = 4c
Throughput: TZCNT runs on p1. SBB, ADC, and SHRX only run on p06. So I think we bottleneck on 3 uops for p06 per iteration, making this run at best one iteration per 1.5c.

If q and d become ready at the same time, note that this version can run SUB/ADC in parallel with the 3c latency of TZCNT. If both are coming from the same cache-miss cache line, it's certainly possible. In any case, introducing the dep on q as late as possible in the d->result dependency chain is an advantage.

Getting this from C seems unlikely with gcc5.4. There is an intrinsic for add-with-carry, but gcc makes a total mess of it. It doesn't use immediate operands for ADC or SBB, and stores the carry into an integer reg between every operation. gcc7, clang3.9, and icc17 all make terrible code from this.

#include <x86intrin.h>
// compiles to completely horrible code, putting the flags into integer regs between ops.
T ceil_div_adc(T d, T q) {
  T e = lg(q);
  unsigned long long dm1;  // unsigned __int64
  unsigned char CF = _addcarry_u64(0, d, -1, &dm1);
  CF = _addcarry_u64(CF, 0, dm1, &dm1);
  T shifted = dm1 >> e;
  _subborrow_u64(CF, shifted, -1, &dm1);
  return dm1;
}
    # gcc5.4 -O3 -march=haswell
    mov     rax, -1
    tzcnt   rsi, rsi
    add     rdi, rax
    setc    cl
    xor     edx, edx
    add     cl, -1
    adc     rdi, rdx
    setc    dl
    shrx    rdi, rdi, rsi
    add     dl, -1
    sbb     rax, rdi
    ret

CMOV to fix the whole result: worse latency from q->result, since it's used sooner in the d->result dep chain.

ceil_div_pjc_asm_cmov:
 tzcnt  rsi, rsi
 sub    rdi, 1
 shrx   rax, rdi, rsi
 lea    rax, [rax+1]     ; inc preserving flags
 cmovc  rax, zeroed_register
 ret

Fused-domain uops: 5 (not counting ret) on SKL. 6 on HSW
SKL latency:
- q->result: tzcnt+shrx+lea+cmov = 6c (worse than ADC/SBB by 1c)
- d->result: sub+shrx(dep on q begins here)+lea+cmov = 4c
Throughput: TZCNT runs on p1. LEA is p15. CMOV and SHRX are p06. SUB is p0156. In theory only bottlenecked on fused-domain uop throughput, so one iteration per 1.25c. With lots of independent operations, resource conflicts from SUB or LEA stealing p1 or p06 shouldn't be a throughput problem because at 1 iter per 1.25c, no port is saturated with uops that can only run on that port.

CMOV to get an operand for SUB: I was hoping I could find a way to create an operand for a later instruction that would produce a zero when needed, without an input dependency on q, e, or the SHRX result. This would help if d is ready before q, or at the same time.

This doesn't achieve that goal, and needs an extra 7-byte mov rdx,-1 in the loop.

ceil_div_pjc_asm_cmov:
 tzcnt  rsi, rsi
 mov    rdx, -1
 sub    rdi, 1
 shrx   rax, rdi, rsi
 cmovnc rdx, rax
 sub    rax, rdx       ; res += d ? 1 : -res
 ret

Lower-latency version for pre-BDW CPUs with expensive CMOV, using SETCC to create a mask for AND.

ceil_div_pjc_asm_setcc:
 xor    edx, edx        ; needed every iteration

 tzcnt  rsi, rsi
 sub    rdi, 1

 setc   dl              ; d!=0 ?  0 : 1
 dec    rdx             ; d!=0 ? -1 : 0   // AND-mask

 shrx   rax, rdi, rsi
 inc    rax
 and    rax, rdx        ; zero the bogus result if d was initially 0
 ret

Still 4c latency from d->result (and 6 from q->result), because the SETC/DEC happen in parallel with the SHRX/INC. Total uop count: 8. Most of these insns can run on any port, so it should be 1 iter per 2 clocks.

Of course, for pre-HSW, you also need to replace SHRX.

We can get gcc5.4 to emit something nearly as good: (still uses a separate TEST instead of setting mask based on sub rdi, 1, but otherwise the same instructions as above). See it on Godbolt.

T ceil_div_andmask(T p, T q) {
    T mask = -(T)(p!=0);  // TEST+SETCC+NEG
    T e = lg(q);
    T nonzero_result = ((p-1) >> e) + 1;
    return nonzero_result & mask;
}

When the compiler knows that p is non-zero, it takes advantage and makes nice code:

// http://stackoverflow.com/questions/40447195/can-i-hint-the-optimizer-by-giving-the-range-of-an-integer
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#define assume(x) do{if(!(x)) __builtin_unreachable();}while(0)
#else
#define assume(x) (void)(x) // still evaluate it once, for side effects in case anyone is insane enough to put any inside an assume()
#endif

T ceil_div_andmask_nonzerop(T p, T q) {
  assume(p!=0);
  return ceil_div_andmask(p, q);
}
    # gcc5.4 -O3 -march=haswell
    xor     eax, eax             # gcc7 does tzcnt in-place instead of wasting an insn on this
    sub     rdi, 1
    tzcnt   rax, rsi
    shrx    rax, rdi, rax
    add     rax, 1
    ret

Chux / user23_variant

only 3c latency from p->result, and constant q can CSE a lot.

T divide_A_chux(T p, T q) {
  bool round_up = p & (q-1);  // compiles differently from user23_variant with clang: AND instead of 
  return (p >> lg(q)) + round_up;
}

    xor     eax, eax           # in-place tzcnt would save this
    xor     edx, edx           # target for setcc
    tzcnt   rax, rsi
    sub     rsi, 1
    test    rsi, rdi
    shrx    rdi, rdi, rax
    setne   dl
    lea     rax, [rdx+rdi]
    ret

Doing the SETCC before TZCNT would allow an in-place TZCNT, saving the xor eax,eax. I haven't looked at how this inlines in a loop.

Fused-domain uops: 8 (not counting ret) on HSW/SKL
HSW/SKL latency:
- q->result: (tzcnt+shrx(p) | sub+test(p)+setne) + lea(or add) = 5c
- d->result: test(dep on q begins here)+setne+lea = 3c. (the shrx->lea chain is shorter, and thus not the critical path)
Throughput: Probably just bottlenecked on the frontend, at one iter per 2c. Saving the xor eax,eax should speed this up to one per 1.75c (but of course any loop overhead will be part of the bottleneck, because frontend bottlenecks are like that).

Awesome stuff. Still, the C versions have the big advantage of being able to be inlined, and then have all the optimizations applied which result from that. Even if an inline-asm function can be inlined, the further optimizations that the inlining enables don't occur, so you only save the `call` + `ret` overhead. I am just finishing up my benchmark, and it does have a "call function" variant where the functions are declared in another compilation unit, disabling inlining, but pretty much every function tied at 4 or 5 cycles latency (even a dummy "empty" function). — BeeOnRope, Nov 08 '16 at 19:24
... so I'm mostly focusing on more "realistic" variants where the function body is available to the compiler at the call site, so inlining and cross-function optimization plays a huge role. The scenarios include where the divisor is a compile-time constant, where the divisor is not a compile-time constant, but is fixed for the duration of the benchmarking loop, and finally a scenarios where the divisor changes every iteration. — BeeOnRope, Nov 08 '16 at 19:26
@BeeOnRope: I'm not suggesting these for actual use as inline asm, just posting this as the goal for what we're hoping to coax a compiler into emitting! I've still had zero success getting any compiler to read CF after `sub rdi,1`, but at least a TEST can run in parallel with an LEA so it doesn't hurt latency. I noticed that especially with constant `q`, compilers will auto-vectorize `total += ceil_div(p[i], constant_q);` ICC even vectorizes `total += ceil_div(p[i], q[i]);` with some implementations, but it looks insane. (Still scalar BSF). — Peter Cordes, Nov 08 '16 at 20:06
That makes sense, although even trying to get the compiler to emit great ASM for the standalone function may in some cases be wasted effort, since when inlined things look different (e.g., the note that it seems to use a branch for Chris Dodd's solution when compiled standalone, but when actually used and inlined, the jump always seems to go away replaced by a conditional). I've added some benchmark code to the question - the loops like [this for latency](https://github.com/travisdowns/ceiling_dev/blob/master/bench_latency.c), and see `bench_throughput.c` for tput. — BeeOnRope, Nov 08 '16 at 22:41
Yeah I saw a lot of auto-vectorization since 2 of my 3 test categories have loop-invariant `q`, so autovectorization can kick in. Autovectorizing with scalar `bsf` sounds terrible since the overhead to get the shifts counts into a vector must suck. — BeeOnRope, Nov 08 '16 at 22:43
@BeeOnRope: yeah, I think I'm going to go do something else now. The time it's possible to spend optimizing this is getting huge. I don't think I want to take the time to play with them all inlining into different loops, since I have some other stuff to get done. — Peter Cordes, Nov 08 '16 at 22:50
Out of curiosity, did the `assume()` trick you used above help the codegen? I also looked that up (didn't know gcc didn't have a built-in way to do this) but didn't test it out. — BeeOnRope, Nov 08 '16 at 23:07
@BeeOnRope: yes, that's why there is no test/setcc or AND in the version that assumes p!=0. It's pretty cool! BTW, I think the AND-mask version has no advantages vs. the Chux / user23_variant version, except maybe when one arg or the other is invariant. With both variable, it has worse latency and the same uop count. (Oh, except that clang compiles it to a good CMOV-at-the-end version.) — Peter Cordes, Nov 08 '16 at 23:11
I tried using STOKE to find answers exhaustively, and put some of my results in an answer. It was pretty clever for 32-bit, but didn't any 64-bit answers that were as good as your best, partly since it seems to misunderstand latency of some operations. — BeeOnRope, Nov 12 '16 at 05:02
Peter - I added your `ceil_div_andmask` method to the roundup. In the actual benchmark, `p` is [the running total](https://github.com/travisdowns/ceiling_dev/blob/master/bench_latency.c#L15), so the compiler can't prove that `p` is zero, so it still generates all the conditional-move-if-p-zero code. In a different scenario, it could be faster. — BeeOnRope, Nov 15 '16 at 01:52
@BeeOnRope: Thanks. If you or someone else is feeling ambitious, testing it on HSW would be interesting. My andmask idea is designed to run well on pre-Broadwell, where CMOV and ADC are 2 uops. I wasn't expecting it to do well on SKL, but I pointed it out because it's one of the few where I got the compiler to produce near-optimal asm. But use23_variant/chux are probably still better in that case. — Peter Cordes, Nov 15 '16 at 02:08
Did your testing include any that branch? Did you try profile-guided optimization? gcc might decide to use a branch if PGO shows that the branch would have gone one way almost all the time, even if it decides on branchless without PGO. chux/user23 with a branch might do better for throughput, unless powers of 2 are common in the input. (macro-fused TEST/JNE to skip an INC or something, replacing XOR/SETNE). — Peter Cordes, Nov 15 '16 at 02:14
Yes, the `chrisdodd` solution did branch in the "throughput" variant of the test, but not the "latency" one. I added a little section just now in the question under heading branchfree. You can also check out the full assembly listing for the [latency](https://github.com/travisdowns/ceiling_dev/blob/master/bench_latency.c) and [throughput](https://github.com/travisdowns/ceiling_dev/blob/master/bench_throughput.c) tests. — BeeOnRope, Nov 15 '16 at 02:37
I didn't use PGO. Partly because I'm lazy, and partly because the test is already very artificial. Running PGO on top of an artificial test set makes it _super-artificial_. At a minimum, I'd need to come up with more realistic input data and further break all the tests down by "branch well-predicted" and "branch not-well predicted". That's further complicated by the fact that not all the algorithms are branching in the same cases! — BeeOnRope, Nov 15 '16 at 03:00
BTW, the bounty went to `user2357112` since he came up with the winning solution _under the above particular conditions, including some unstated conditions_ - but I'm accepting this answer as the best overall summary of the various options and a deep look at it from the x86 assembly point of view. Even though the discussion is x86-oriented, many the key points apply to most other architectures (but fusion may be different, there may be no `adc`, etc), and also often map back to C. — BeeOnRope, Nov 15 '16 at 03:02
@BeeOnRope: yeah, the artificiality of the input is problematic. But at some point you just have to give up on "the general case", because it's too huge. Very nice work on this question; I think there's some useful stuff here now in terms of test procedures as well as the actual result. (And thanks for the accept vote; I wasn't really aiming for the bounty once I gave up on wrestling with the compiler to emit any of the optimal asm I thought of.) — Peter Cordes, Nov 15 '16 at 03:02

user2357112 · Answer 2 · 2016-11-04T21:31:01.787

11

uint64_t exponent = lg(q);
uint64_t mask = q - 1;
//     v divide
return (p >> exponent) + (((p & mask) + mask) >> exponent)
//                       ^ round up

The separate computation of the "round up" part avoids the overflow issues of (p+q-1) >> lg(q). Depending on how smart your compiler is, it might be possible to express the "round up" part as ((p & mask) != 0) without branching.

edited Nov 04 '16 at 21:31

answered Nov 04 '16 at 21:25

user2357112

260,549
28
431
505

Indeed, [godbolt shows](https://godbolt.org/g/oYZtHM) that at least recent versions of gcc, clang and icc produce branch free code for your `!= 0` variant. Based on my simple counting, both versions have a latency of 5 cycles, so the extra shifting doesn't hurt too much (at least if you target an architecture with `BMI` so you get `shrx` and friends. – BeeOnRope Nov 04 '16 at 22:16
Thannks @user2357113 - I exhaustively validated (for _all_ p/q) the 32-bit versions of both versions of your code (the one shown and the with `((p & mask) != 0`). No issues found. Will post performance results soon. – BeeOnRope Nov 05 '16 at 22:35
Your variant with `((p & mask) != 0)` was the winner overall! In practice compilers optimize it well and it vectorized well. Enjoy your bounty! – BeeOnRope Nov 15 '16 at 01:54
1

I accepted Peter's answer as the overall answer because the truth is everything is highly dependent on the input, the compiler and the surrounding code, so his survey is probably the most comprehensive way to look at the issue. – BeeOnRope Nov 15 '16 at 01:55

score 8 · Answer 3 · edited Nov 07 '16 at 09:38

8

The efficient way of dividing by a power of 2 for an unsigned integer in C is a right shift -- shifting right one divides by two (rounding down), so shifting right by n divides by 2ⁿ (rounding down).

Now you want to round up rather than down, which you can do by first adding 2ⁿ-1, or equivalently subtracting one before the shift and adding one after (except for 0). This works out to something like:

unsigned ceil_div(unsigned d, unsigned e) {
    /* compute ceil(d/2**e) */
    return d ? ((d-1) >> e) + 1 : 0;
}

The conditional can be removed by using the boolean value of d for addition and subtraction of one:

unsigned ceil_div(unsigned d, unsigned e) {
    /* compute ceil(d/2**e) */
    return ((d - !!d) >> e) + !!d;
}

Due to its size, and the speed requirement, the function should be made static inline. It probably won't make a different for the optimizer, but the parameters should be const. If it must be shared among many files, define it in a header:

static inline unsigned ceil_div(const unsigned d, const unsigned e){...

edited Nov 07 '16 at 09:38

2501

25,460
4
47
87

answered Nov 04 '16 at 22:10

Chris Dodd

119,907
13
134
226

Nice!. Will have to remember. – chux - Reinstate Monica Nov 04 '16 at 22:42
Thanks. Keep in mind that per my question, the second argument `q` is the divisor, which _is_ a power of two by contract, but the value of `q` is the plain divisor, not the power. That is, to execute `10 / 8` you would call `divide(10, 8)`, not `divide(10, 3)` (as 8 = 2^3). You can use the method `lq(x)` in your answer if you want to get the exponent. – BeeOnRope Nov 04 '16 at 22:46
@BeeOnRope: then you can safely use gcc's `__builtin_ctzll(q)` or whatever other trailing-zero count method to get the log2 (shift count). If you want to put in an `assert`, there are efficient ways to check that an integer is a power of 2 (only has a single bit set). (That GNU C builtin gives an undefined result if `q` == 0, so on x86 it can compile to just a `BSF` with no error check.) – Peter Cordes Nov 06 '16 at 20:55
Yes, of course - just pointing out to the OP that I will have to modify his function to compare it on an apples-for-apples basis with the other solutions. – BeeOnRope Nov 06 '16 at 20:56
Thanks @Chris - I've exhaustively validated your solution for all 32-bit values, and preliminary indications are that it is performance competitive with the other top solutions. – BeeOnRope Nov 06 '16 at 21:23
@BeeOnRope: I was about to throw everyone's answers onto Godbolt and see what they look like. Are you about to post something like that yourself? If so, I'll hold off on doing preparing an answer like that. – Peter Cordes Nov 06 '16 at 21:33
I am putting them all into a C project, and was going to include (1) latency timing, (2) throughput timing and (3) code-size in bytes. I've put a few on godbot also, and found some interesting stuff (e.g., the code surrounding `__builtin_clz` seems to vary in its behavior depending on `march` value which is weird). So yeah there might be some redundancy in that. – BeeOnRope Nov 06 '16 at 21:41
@BeeOnRope: Why are you using `__builtin_clz` to count *leading* zeros (high bits), instead of using `__builtin_ctz` to count trailing zeros (low bits). The main difference with -march there is that LZCNT and BSR return opposite results. (BSR(q) = 64-LZCNT(q)). – Peter Cordes Nov 06 '16 at 22:18
Because I don't think you can implement `lg()` with `ctz` alone? Can you clarify how you'd do it with that instruction? FWIW, about `bsr` vs `lzcnt`: `gcc` at least is fully smart enough to use `lzcnt` and `bsr` interchangeably to implement `clz` - depending on the surrounding code, and whether the _UB_ at zero is relevant, it will pick either one. My impression is that Intel introduced `lzcnt` _only_ to get around the _UB_ with all-zero inputs. – BeeOnRope Nov 06 '16 at 22:20
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/127500/discussion-between-beeonrope-and-peter-cordes). – BeeOnRope Nov 06 '16 at 22:24
As @PeterCordes pointed out, when `q` is a power of 2, `clz(x)` is equivalent to `31 - ctx(x)` (and vise-versa). So I could have used `__builtin_ctz` without the `63U - ...` part. As it turns out, however, gcc is smart enough to remove the subtraction and generate optimal code with a single `bsr` - yet for the functionally identical `ctz` variant it decides to use `lzcnt` with an `xor` which doesn't seem better. Also, `gcc` treats the `lg` oddly depending on the `-march` parameter, which is weird enough that I created a [separate question](http://stackoverflow.com/q/40456481) to cover it. – BeeOnRope Nov 07 '16 at 02:18
To remove the ternary conditional (for branch prediction purposes) you can do `!!d + ((d-1) >> e)` – technosaurus Nov 07 '16 at 04:04
@technosaurus: That no longer solves the problem of 0-1 wrapping to UINT_MAX, and an unsigned shift of that leaving a large number, when the correct result of 0 / anything = 0. (QuestionC's answer just uses a constant +1 instead of +!!d, which works for signed types because -1 >>1 = -1 (with an arithmetic right shift)). Anyway, **see https://godbolt.org/g/g9StKJ for the answer so far**, in gcc/clang/icc with -O3 -march=haswell for all 3 compilers side-by-side. (yes, Godbolt can do that now :) If d==0 is rare (or always happens), branching on it is totally fine (and actually preferable). – Peter Cordes Nov 07 '16 at 05:19
@PeterCordes - yeah, I missed that in passing, I thought you were trying to do something like: `unsigned ceil_div(unsigned d, unsigned e){return (d+((1U<> e;}` which compiles to 4 instructions on ICC, GCC and clang. – technosaurus Nov 07 '16 at 18:19
I hope my edit didn't convert that upvote, I just wanted to improve the answer, since I though that a standalone answer of my own would be superfluous. – 2501 Nov 07 '16 at 19:10
Chris, I added benchmarks for your solution with the suffix _chrisdodd in the update to my question. It's competitive! – BeeOnRope Nov 08 '16 at 22:37

score 8 · Answer 4 · edited May 23 '17 at 11:54

8

Efficiently dividing unsigned value by a power of two, rounding up

[Re-write] given OP's clarification concerning power-of-2.

The round-up or ceiling part is easy when overflow is not a concern. Simple add q-1, then shift.

Otherwise as the possibility of rounding depends on all the bits of p smaller than q, detection of those bits is needed first before they are shifted out.

uint64_t divide_A(uint64_t p, uint64_t q) {
  bool round_up = p & (q-1);
  return (p >> lg64(q)) + round_up;
}

This assumes code has an efficient lg64(uint64_t x) function, which returns the base-2 log of x, if x is a power-of-two.`

edited May 23 '17 at 11:54

Community

1
1

answered Nov 04 '16 at 22:38

chux - Reinstate Monica

143,097
13
135
256

Thanks. Keep in mind that per my question, the second argument `q` is the divisor, which _is_ a power of two by contract, but the value of `q` is the plain divisor, not the power. That is, to execute `10 / 8` you would call `divide(10, 8)`, not `divide(10, 3)` (as 8 = 2^3). – BeeOnRope Nov 04 '16 at 22:45
I used your functions [as shown here](https://gist.github.com/travisdowns/edb91e38dede70a6fa2b882a3790328e) - including 32-bit implementation, using `stdbool.h` for `bool`. It validated fine for all ~137 billion 32-bit inputs, so it almost certainly works in general. I think this answer is equivalent to @user2357112's variant with `(p & mask) != 0` and initial timing shows they are the same. – BeeOnRope Nov 06 '16 at 21:38
Chux, I added benchmarks of your solution using the suffix `_chux` above. Your solution is competitive with the best! – BeeOnRope Nov 08 '16 at 22:38
@BeeOnRope: This version has the lowest p->result latency of any branchless version, and compiles to non-terrible asm. I added the asm to my answer in an update. It probably also auto-vectorizes better than most (when the same q is used with multiple dividends). – Peter Cordes Nov 08 '16 at 22:39
1

@PeterCordes, yes - but note also that this is identical to user2357112's "variant" answer (see the last part of his answer), which I've included as `divide_user23_variant` above. user23... had the first answer and already included that there, so I'll give him credit if it comes to it. – BeeOnRope Nov 08 '16 at 22:49
FWIW, your new code finished (tied for) first place (see Results above), but I awarded the bounty to user2357112 since he came up with it first (in the first answer nonteheless). – BeeOnRope Nov 15 '16 at 03:12
@BeeOnRope This answer presented code the showed the `bool` approach. To me the other's prior idea about a _possibility_ was obscure in it presentation and explanation, else I would not have posted. – chux - Reinstate Monica Nov 15 '16 at 03:27
The earlier comment answer was very clear to me: _"...express the "round up" part as ((p & mask) != 0)"_. The code even excluded an indicator as to exactly which part was considered "round up" so there was no ambiguity. It would be unfair the penalize that poster just because they did not bother with the copy/paste to write the nearly identical variant out in long-form. In any case, be happy that you independently across a good solution to this problem! – BeeOnRope Nov 15 '16 at 03:38

Chris · Answer 5 · 2016-11-07T18:48:44.587

7

My old answer didn't work if p was one more than a power of two (whoops). So my new solution, using the __builtin_ctzll() and __builtin_ffsll() functions⁰ available in gcc (which as a bonus, provides the fast logarithm you mentioned!):

uint64_t divide(uint64_t p,uint64_t q) {
    int lp=__builtin_ffsll(p);
    int lq=__builtin_ctzll(q);
    return (p>>lq)+(lp<(lq+1)&&lp);
}

Note that this is assuming that a long long is 64 bits. It has to be tweaked a little otherwise.

The idea here is that if we need an overflow if and only if p has fewer trailing zeroes than q. Note that for a power of two, the number of trailing zeroes is equal to the logarithm, so we can use this builtin for the log as well.

The &&lp part is just for the corner case where p is zero: otherwise it will output 1 here.

⁰ Can't use __builtin_ctzll() for both because it is undefined if p==0.

edited Nov 07 '16 at 18:48

answered Nov 04 '16 at 22:08

Chris

1,613
17
24

2

This fails for `p=1, q=2`. Correct answer is 1, but your code gives zero: `((1>>1)+(2>>1)-1)>>(lg(2)-1) == (0+1-1)>>(1-1) == 0 >> 0 == 0`. – BeeOnRope Nov 06 '16 at 22:11
@BeeOnRope Looking at it a bit more, it failed for any number one more than a power of two. Whoops! Hopefully this answer will be more satisfying. And I actually tested thoroughly this time. – Chris Nov 07 '16 at 06:59
Right, I only report the first failure :) – BeeOnRope Nov 07 '16 at 15:12
Chris, are you missing parens around `p>>lq`? The code as shown fails still for `p=1, q=2`, but I think you likely meant `return (p>>lq)+(lp<(lq+1)&&lp)`. Recall that `+` is higher precedence than `>>`. – BeeOnRope Nov 07 '16 at 15:22
2

Your new function works and I added benchmarks for it above. It's considerably slower than the leading solutions however. – BeeOnRope Nov 08 '16 at 22:45

technosaurus · Answer 6 · 2018-01-09T14:20:05.720

6

If the dividend/divisor can be guaranteed not to exceed 63 (or 31) bits, you can use the following version mentioned in the question. Note how p+q could overflow if they use all 64 bit. This would be fine if the SHR instruction shifted in the carry flag, but AFAIK it doesn't.

uint64_t divide(uint64_t p, uint64_t q) {
  return (p + q - 1) >> lg(q);
}

If those constraints cannot be guaranteed, you can just do the floor method and then add 1 if it would round up. This can be determined by checking if any bits in the dividend are within the range of the divisor. Note: p&-p extracts the lowest set bit on 2s complement machines or the BLSI instruction

uint64_t divide(uint64_t p, uint64_t q) {
  return (p >> lg(q)) + ( (p & -p ) < q );
}

Which clang compiles to:

    bsrq    %rax, %rsi
    shrxq   %rax, %rdi, %rax
    blsiq   %rdi, %rcx
    cmpq    %rsi, %rcx
    adcq    $0, %rax
    retq

That's a bit wordy and uses some newer instructions, so maybe there is a way to use the carry flag in the original version. Lets see: The RCR instruction does but seems like it would be worse ... perhaps the SHRD instruction... It would be something like this (unable to test at the moment)

    xor     edx, edx     ;edx = 0 (will store the carry flag)
    bsr     rcx, rsi     ;rcx = lg(q) ... could be moved anywhere before shrd
    lea     rax, [rsi-1] ;rax = q-1 (adding p could carry)
    add     rax, rdi     ;rax += p  (handle carry)
    setc    dl           ;rdx = carry flag ... or xor rdx and setc
    shrd    rax, rdx, cl ;rax = rdx:rax >> cl
    ret

1 more instruction, but should be compatible with older processors (if it works ... I'm always getting a source/destination swapped - feel free to edit)

Addendum:

I've implemented the lg() function I said was available as follows:

inline uint64_t lg(uint64_t x) {
  return 63U - (uint64_t)__builtin_clzl(x);
}

inline uint32_t lg32(uint32_t x) {
  return 31U - (uint32_t)__builtin_clz(x);
}

The fast log functions don't fully optimize to bsr on clang and ICC but you can do this:

#if defined(__x86_64__) && (defined(__clang__) || defined(__INTEL_COMPILER))
static inline uint64_t lg(uint64_t x){
  inline uint64_t ret;
  //other compilers may want bsrq here
  __asm__("bsr  %0, %1":"=r"(ret):"r"(x));
  return ret;
}
#endif

#if defined(__i386__) && (defined(__clang__) || defined(__INTEL_COMPILER))    
static inline uint32_t lg32(uint32_t x){
  inline uint32_t ret;
  __asm__("bsr  %0, %1":"=r"(ret):"r"(x));
  return ret;
}
#endif

edited Jan 09 '18 at 14:20

answered Nov 09 '16 at 02:43

technosaurus

7,676
1
30
52

FWIW, that's exactly how I implemented `lg32` and `lg64`. It _does_ fully optimize down to a single `bsr` with `-mtune=haswell`, but not without that flag, which is weird since that instruction has been around forever and the added stuff is just cruft. Weird enough that I made [another question](http://stackoverflow.com/questions/40456481/gcc-compiles-leading-zero-count-poorly-unless-haswell-specified) about it specifically. – BeeOnRope Nov 09 '16 at 02:50
... and the problem with the inline asm stuff is it will break optimization around it. I don't think it will do CSE on it, and it certainly won't calculate the value if the input is a compile-time constant. – BeeOnRope Nov 09 '16 at 02:52
@BeeOnRope: non-`volatile` `asm` statements are considered pure functions of their inputs, and assumed to have no side-effects, so they can CSE (or be optimized away entirely if their output operands are unused). gcc and clang do that, but ICC doesn't (https://godbolt.org/g/zsvPYA) :/ BTW, those defined() conditions need parens around clang||icc, and the `inline` keyword doesn't do anything on function args or locals. It does prevent constant propagation, so I highly don't recommend it. – Peter Cordes Nov 09 '16 at 03:33
But anyway, for clang at least, the right solution is to just use `__builtin_ctz` to count trailing zeros, so there's no 64 - x to optimize away in the first place. Then compilers can use `tzcnt` or `bsf`, whichever they prefer. @techno: see the godbolt link in my answer. – Peter Cordes Nov 09 '16 at 03:35
Huh, so they are better supported than I thought! The last part still holds though - the compiler won't "look into" the asm and calculate the result of the `bsr`, which could be key to speeding up the "constant q" case. In fact though, on Haswell, constant q is only about as fast as the weaker "invariant q" - partly because `shrx` and friends are better than the old variable shifts. Vectorization is somewhat better with constant q versus invariant. – BeeOnRope Nov 09 '16 at 03:37
@PeterCordes - thanks - fixed the inline and parens, The ctz version makes sense ... it would almost be easier if the "q" parameter were the power of 2 instead of 2 raised to a power. – technosaurus Nov 09 '16 at 03:43
*if the SHR instruction shifted in the carry flag* You're right, it doesn't. `rcr r, 1` does, and the count=1 version is 3 uops with 2c latency. (And then you do the rest of the shift with SHRX by `lg(q)-1`.) I put a version of this idea in the godbolt link in my answer, but didn't mention it in my actual answer. It breaks for q=1, because the RCR is unconditional. Variable-count RCR is horrible (8 uops, 6c latency), but you can get correct results by doing that and then masking the high garbage (with a mask calculated from the shift count, `(1ULL<<(64-e)) - 1` or something). – Peter Cordes Nov 09 '16 at 03:44
Oh, I see you did try that. Yeah, SHRD to shift in the carry and then zeros looks like the best way to implement that idea. `adc rdx, 0` only sets RDX=CF if RDX was originally zero... Perhaps you meant `sbb rdx,rdx` / `neg rdx`? But note that it still has a dependency on RDX, on CPUs other than AMD (where it's recognized as only depending on CF), so xor-zero / generate flags / setc dl is probably better, unless you can choose a register that is used by something dependency-breaking between invocations of this. – Peter Cordes Nov 09 '16 at 03:52
@PeterCordes - I had `xor edx, edx` at the top but removed it for some reason when I changed `setc` to `add`, but hadn't considered `sbb rdx, rdx`... the xor + setc is more instructions but probably faster in real life anyhow, so I'll put that back. – technosaurus Nov 09 '16 at 03:59
Your BLSI idea is cool. Too bad gcc5.4 doesn't use adc the way clang does: it ends up with SETC / MOVZX / LEA instead of ADC (which is 1 uop with 1c latency on Broadwell and newer). Clang's asm has p->result latency = 3c (SHRX(dep on lg(q)) | BLSI+CMP(q)) + ADC. q->result latency = 5c TZCNT + (SHRX|CMP)(dep on p) + ADC. And it's 5 uops. So it's as good as anything I've looked at so far, I think. I had wondered if you could do anything with BLSI, but hadn't figured out what yet. – Peter Cordes Nov 09 '16 at 04:08
So with gcc5.4, it has an extra 2c of latency after the CMP (so it's in both the p and q chains). xor ahead of flag setting, then setcc, would be just 1 more cycle on the critical path, but gcc chose MOVZX, at least for the stand-alone version of the function. – Peter Cordes Nov 09 '16 at 04:10
@PeterCordes its really hard to say, I don't have Agner Fog's instruction tables and the variations in pipelines between architectures committed to memory. For instance, the `BSR` instruction in my assembly code can go anywhere before the `SHRD` and the `XOR` can go anywhere before the `SETC`, but I would have to play around with it to figure out which is optimal for the pipeline. Even though there may be more/slower instructions in the gcc version, the architecture may be able to perform them nearly simultaneously. – technosaurus Nov 09 '16 at 08:30
@technosaurus: I don't have it all memorized, I had to tab over to it and search for BLSI to see if it was 1 or 3c. Turns out 1c. Note that XOR sets flags, so it actually has to go before the CMP. I've always been assuming that out-of-order execution does a perfect job of running everything in parallel when data dependencies don't prevent that. Instruction re-ordering isn't usually signficant, just the dependency chains. – Peter Cordes Nov 09 '16 at 15:51
"p&-p extracts the lowest set bit on 2s complement machines" --> misleads. Arguments are not 2's complement. They are unsigned. – chux - Reinstate Monica Nov 10 '16 at 13:30
Sure they are unsigned _types_ in C but that trick still works fine as long as the machine is twos complement. So that statement seems fine to me. – BeeOnRope Nov 12 '16 at 13:37

score 5 · Answer 7 · edited May 23 '17 at 12:25

There has already been a lot of human brainpower applied to this problem, with several variants of great answers in C along with Peter Cordes's answer which covers the best you could hope for in asm, with notes on trying to map it back to C.

So while the humans are having their kick at the can, I thought see what some brute computing power has to say! To that end, I used Stanford's STOKE superoptimizer to try to find good solutions to the 32-bit and 64-bit versions of this problem.

Usually, a superoptimizer is usually something like a brute force search through all possible instruction sequences until you find the best one by some metric. Of course, with something like 1,000 instructions that will quickly spiral out of control for more than a few instructions¹. STOKE, on the hand, takes a guided randomized approach: it randomly makes mutations to an existing candidate program, evaluating at each step a cost function that takes both performance and correctness into effect. That's the one-liner anyway - there are plenty of papers if that stoked your curiosity.

So within a few minutes STOKE found some pretty interesting solutions. It found almost all the high-level ideas in the existing solutions, plus a few unique ones. For example, for the 32-bit function, STOKE found this version:

neg rsi                         
dec rdi                         
pext rax, rsi, rdi
inc eax
ret

It doesn't use any leading/trailing-zero count or shift instructions at all. Pretty much, it uses neg esi to turn the divisor into a mask with 1s in the high bits, and then pext effectively does the shift using that mask. Outside of that trick it's using the same trick that user QuestionC used: decrement p, shift, increment p - but it happens to work even for zero dividend because it uses 64-bit registers to distinguish the zero case from the MSB-set large p case.

I added the C version of this algorithm to the benchmark, and added it to the results. It's competitive with the other good algorithms, tying for first in the "Variable Q" cases. It does vectorize, but not as well as the other 32-bit algorithms, because it needs 64-bit math and so the vectors can process only half as many elements at once.

Even better, in the 32-bit case it came up with a variety of solutions which use the fact that some of the intuitive solutions that fail for edge cases happen to "just work" if you use 64-bit ops for part of it. For example:

tzcntl ebp, esi      
dec esi             
add rdi, rsi        
sarx rax, rdi, rbp
ret

That's the equivalent of the return (p + q - 1) >> lg(q) suggestion I mentioned in the question. That doesn't work in general since for large p + q it overflows, but for 32-bit p and q this solution works great by doing the important parts in 64-bit. Convert that back into C with some casts and it actually figures out that using lea will do the addition in one instruction¹:

stoke_32(unsigned int, unsigned int):
        tzcnt   edx, esi
        mov     edi, edi          ; goes away when inlining
        mov     esi, esi          ; goes away when inlining
        lea     rax, [rsi-1+rdi]
        shrx    rax, rax, rdx
        ret

So it's a 3-instruction solution when inlined into something that already has the values zero-extended into rdi and rsi. The stand-alone function definition needs the mov instructions to zero-extend because that's how the SysV x64 ABI works.

For the 64-bit function it didn't come up with anything that blows away the existing solutions but it did come up with some neat stuff, like:

  tzcnt  r13, rsi      
  tzcnt  rcx, rdi      
  shrx   rax, rdi, r13 
  cmp    r13b, cl        
  adc    rax, 0        
  ret

That guy counts the trailing zeros of both arguments, and then adds 1 to the result if q has fewer trailing zeros than p, since that's when you need to round up. Clever!

In general, it understood the idea that you needed to shl by the tzcnt really quickly (just like most humans) and then came up with a ton of other solutions to the problem of adjusting the result to account for rounding. It even managed to use blsi and bzhi in several solutions. Here's a 5-instruction solution it came up with:

tzcnt r13, rsi                  
shrx  rax, rdi, r13             
imul  rsi, rax                   
cmp   rsi, rdi                    
adc   rax, 0                    
ret

It's a basically a "multiply and verify" approach - take the truncated res = p \ q, multiply it back and if it's different than p add one: return res * q == p ? ret : ret + 1. Cool. Not really better than Peter's solutions though. STOKE seems to have some flaws in its latency calculation - it thinks the above has a latency of 5 - but it's more like 8 or 9 depending on the architecture. So it sometimes narrows in solutions that are based on its flawed latency calculation.

¹ Interestingly enough though this brute force approach reaches its feasibility around 5-6 instructions: if you assume you can trim the instruction count to say 300 by eliminating SIMD and x87 instructions, then you would need ~28 days to try all 300 ^ 5 5 instruction sequences at 1,000,000 candidates/second. You could perhaps reduce that by a factor of 1,000 with various optimizations, meaning less than an hour for 5-instruction sequences and maybe a week for 6-instruction. As it happens, most of the best solutions for this problem fall into that 5-6 instruction window...

² This will be a slow lea, however, so the sequence found by STOKE was still optimal for what I optimized for, which was latency.

Is `pext rsi, rdi, rax` in `.att_syntax noprefix`, with the destination last? Or did it just end up that way after removing the `%` decorators manually but not reversing the operands? >.< Neat stuff, but the code blocks need fixing. Maybe just leave them in proper AT&T syntax to avoid risk of error. — Peter Cordes, Nov 12 '16 at 05:29
It looks like some of the 32-bit sequences assume that the 32-bit values are zero-extended to fill a 64-bit register. That is of course normally free when inlining. — Peter Cordes, Nov 12 '16 at 05:34
The output was AT&T so yeah I tried to fix everything up by hand (is there a converter out there?). I certainly may have made some mistakes - let me double check them. About the high bits - the way STOKE works is that it captures the actual function input across a variety of test cases and uses that for its search and to validate solutions. Since in practice the high bits of `rdi` and `rsi` were always zero, it created solutions that rely on it. So yeah it's not 100℅ apples to with the compiler output that needs to deal with garbage. @PeterCordes — BeeOnRope, Nov 12 '16 at 13:47
One safe way to convert is assemble -> disassemble. It's really easy to miss something when converting by hand if you ever get distracted part way through. — Peter Cordes, Nov 12 '16 at 18:13
@PeterCordes - good point, that's pretty safe way to do it. I don't think STOKE has any Intel output options, so I was stuck with AT&T. — BeeOnRope, Nov 14 '16 at 18:38

Weather Vane · Answer 8 · 2016-11-10T20:20:23.013

3

You can do it like this, by comparing dividing n / d with dividing by (n-1) / d.

#include <stdio.h>

int main(void) {
    unsigned n;
    unsigned d;
    unsigned q1, q2;
    double actual;

    for(n = 1; n < 6; n++) {
        for(d = 1; d < 6; d++) {
            actual = (double)n / d;
            q1 = n / d;
            if(n) {
                q2 = (n - 1) / d;
                if(q1 == q2) {
                    q1++;
                }
            }
            printf("%u / %u = %u (%f)\n", n, d, q1, actual);
        }
    }

    return 0;
}

Program output:

1 / 1 = 1 (1.000000)
1 / 2 = 1 (0.500000)
1 / 3 = 1 (0.333333)
1 / 4 = 1 (0.250000)
1 / 5 = 1 (0.200000)
2 / 1 = 2 (2.000000)
2 / 2 = 1 (1.000000)
2 / 3 = 1 (0.666667)
2 / 4 = 1 (0.500000)
2 / 5 = 1 (0.400000)
3 / 1 = 3 (3.000000)
3 / 2 = 2 (1.500000)
3 / 3 = 1 (1.000000)
3 / 4 = 1 (0.750000)
3 / 5 = 1 (0.600000)
4 / 1 = 4 (4.000000)
4 / 2 = 2 (2.000000)
4 / 3 = 2 (1.333333)
4 / 4 = 1 (1.000000)
4 / 5 = 1 (0.800000)
5 / 1 = 5 (5.000000)
5 / 2 = 3 (2.500000)
5 / 3 = 2 (1.666667)
5 / 4 = 2 (1.250000)
5 / 5 = 1 (1.000000)

Update

I posted an early answer to the original question, which works, but did not consider the efficiency of the algorithm, or that the divisor is always a power of 2. Performing two divisions was needlessly expensive.

I am using MSVC 32-bit compiler on a 64-bit system, so there is no chance that I can provide a best solution for the required target. But it is an interesting question so I have dabbled around to find that the best solution will discover the bit of the 2**n divisor. Using the library function log2 worked but was so slow. Doing my own shift was much better, but still my best C solution is

unsigned roundup(unsigned p, unsigned q)
{
    return p / q + ((p & (q-1)) != 0);
}

My inline 32-bit assembler solution is faster, but of course that will not answer the question. I steal some cycles by assuming that eax is returned as the function value.

unsigned roundup(unsigned p, unsigned q)
{
    __asm {
        mov eax,p
        mov edx,q
        bsr ecx,edx     ; cl = bit number of q
        dec edx         ; q-1
        and edx,eax     ; p & (q-1)
        shr eax,cl      ; divide p by q, a power of 2
        sub edx,1       ; generate a carry when (p & (q-1)) == 0
        cmc
        adc eax,0       ; add 1 to result when (p & (q-1)) != 0
    }
}                       ; eax returned as function value

edited Nov 10 '16 at 20:20

answered Nov 04 '16 at 21:47

Weather Vane

33,872
7
36
56

`double` can't exactly represent every 64-bit integer, so this can't work for uint64_t inputs (e.g. divisor = 1 will just return a value rounded to the nearest integer that a `double` can represent, instead of the original. It might work for all dividends with divisors of 2^(64-53) or higher). It's also very clunky to convert *unsigned* 64-bit integers to/from double, on x86, so it's even slower than if the inputs were signed. Good thing it's possible to do this with shifts/adds while avoiding overflow, since this idea has a lot of downsides. – Peter Cordes Nov 06 '16 at 22:02
@PeteCordes I think you misread my answer. The `double` was only there to show how the integer was correctly rounded up - not suggested as part of the solution. – Weather Vane Nov 06 '16 at 22:50
Oh, yes I did, sorry. I was surprised you were suggesting anything as lame as dividing doubles, but I didn't look carefully and the C89-style separate declarations of variables didn't help. – Peter Cordes Nov 07 '16 at 00:05
Unfortunately, even after taking out the float part, [it compiles to two integer DIV r64 instructions on gcc/clang/icc](https://godbolt.org/g/Lu6h9W), so it might give the right answer but is complete garbage for performance. (Integer DIV is usually less pipelined than FP div, so CPUs don't take advantage of the parallelism available). In a compile-time constant loop, you get something different on some compilers, so I pulled out just the actual function into a stand-alone C function. – Peter Cordes Nov 07 '16 at 00:14
@PeteCordes thanks for the comments, I might have a solution but failed to make it efficient - each the integer divisions (by a power of 2) could be a shift. – Weather Vane Nov 07 '16 at 00:38
1

@WeatherVane - I have validated your solution as correct for all 32-bit inputs, but timing-wise preliminary results indicate that it isn't competitive with the other solutions (taking ~2300 seconds for all 137,000,000,000 iterations vs ~200 seconds for the fastest solutions). – BeeOnRope Nov 07 '16 at 01:36
I posted benchmarks in my question now, including your solution. – BeeOnRope Nov 08 '16 at 22:45
@BeeOnRope I have updated my answer. You asked a good question. – Weather Vane Nov 10 '16 at 20:22
edit: Oh I see you were using `/q` instead of `>>lg(q)`, that explains why MSVC's generated asm doesn't beat your inline asm. Your asm doesn't look like anything special (I think setnz/add would be better than sub/cmc/adc), and 5 cycles of store-forwarding latency from transfering input operands through memory doesn't help either. Intel has intrinsics for BSR and BSF (as well as TZCNT/LZCNT), which you could use to implement lg() – Peter Cordes Nov 10 '16 at 21:01
@PeterCordes I already said my solution was nothing special, so I don't need your neg. My C solution is standard C, did I miss `lg`? I already read your superlative answer. – Weather Vane Nov 10 '16 at 21:14
Bee provided a GNU C implementation in the question, but you could make an MSVC-compatible one with other intrinsics. Sorry, not trying to rain on your parade, just pointing out that MSVC inline-asm adds overhead for short sequences like this, so it's definitely worth avoiding by using intrinsics if possible. – Peter Cordes Nov 10 '16 at 21:35
You can look also look at _declspec naked_ which means you can just write your function assuming the Win32/64 ABI, and not have to mess around with a redundant prologue (like `mov eax,p`) to set up your function. So you eliminate both the hidden prologue inserted by the compiler and your proloque in the asm. In any case the solution you came up with was the same as user2357112's "variant" solution and the later proposal by chux, except that you still have a divide: `p / q` and need the compiler to convert it to a shift for constant `q`. Of course this will still be slow for non-constant `q`! – BeeOnRope Nov 12 '16 at 04:59
1

In MSVC you can use the [`_BitScanForward/Backward`](https://msdn.microsoft.com/en-us/library/wfd9z0bb.aspx) intrinsics to emit `bsr` or `bsf`. Not sure about the newer `tzcnt` instructions - perhaps there are some intrinsics (there seems to be intrinsics for most of the BMI stuff). – BeeOnRope Nov 12 '16 at 05:01
@BeeOnRope I did get some speed improvement by using instrisics in the C version, and with `__declspec(naked)` in the asm version, but it is obvious I will not improve on other answers without coming up with a "strange" solution - and then I saw your answer which has found such solutions anyway. – Weather Vane Nov 12 '16 at 10:06
@PeterCordes thanks for your observations, for example the `cmc` is unnecessary if `sbb eax,-1` is used instead of addition. – Weather Vane Nov 12 '16 at 10:06

QuestionC · Answer 9 · 2016-11-04T22:38:16.613

2

This seems efficient and works for signed if your compiler is using arithmetic right shifts (usually true).

#include <stdio.h>

int main (void)
{
    for (int i = -20; i <= 20; ++i) {
        printf ("%5d %5d\n", i, ((i - 1) >> 1) + 1);
    }
    return 0;
}

Use >> 2 to divide by 4, >> 3 to divide by 8, &ct. Efficient lg does the work there.

You can even divide by 1! >> 0

edited Nov 04 '16 at 22:38

answered Nov 04 '16 at 22:09

QuestionC

10,006
4
26
44

depends on implementation-defined behavior for right shifts of negative numbers... – Chris Dodd Nov 04 '16 at 22:11
2

Yeah I specified unsigned (at least for now) to avoid the whole ball-of-wax surrounding signed right shifts. – BeeOnRope Nov 04 '16 at 22:24
1

@QuestionC - I implemented your code (for unsigned arguments) [like this](https://gist.github.com/anonymous/5350e5394a379f99774d01dace85f6ef) - does that seem right? It fails for large `p` (where the MSB is set), because the shift results in a negative number (the MSB is still 1). For example for `p=2147483649, q=2` your method gives `3221225473` - which is larger than the dividend! The correct result is `1073741825`. – BeeOnRope Nov 07 '16 at 02:03
@BeeOnRope The problem there is that `int p = 2147483649` is actually `int p = -2147483647`. If you perform right shift with sign propagation, the result is still correct. If you want to represent that number you need to use wider integer types such as `long`. For treating it as an unsigned, you would just need to change your `int i` by an `unsigned i` and `printf` formatting string to `%u`. – Jorge Bellon Nov 07 '16 at 16:39
Yes, I understand the issue, but it's not clear the transformation you propose will work either. If you have an answer, post it! – BeeOnRope Nov 07 '16 at 17:14
The best way to make this work for 32 bit unsigned integers is to make the function accept 64 bit signed integers, which all 32 bit unsigned can safely promote to. If you make the function use logical right shift (easy, all `unsigned` shifts are logical) then it will work for all positive unsigned values but not zero. You can work around that with a conditional test (seen in other answers). – QuestionC Nov 10 '16 at 14:59

Efficiently dividing unsigned value by a power of two, rounding up

Platform Details

Benchmark

Results

Overall

Analysis

Auto-vectorization

Branches vs Branch-free

Detailed Test Results

Latency

Throughput

9 Answers9

Chux / user23_variant

Linked