33

For min(ctz(x), ctz(y)), we can use ctz(x | y) to gain better performance. But what about max(ctz(x), ctz(y))?

ctz represents "count trailing zeros".

C++ version (Compiler Explorer)

#include <algorithm>
#include <bit>
#include <cstdint>

int32_t test2(uint64_t x, uint64_t y) {
    return std::max(std::countr_zero(x), std::countr_zero(y));
}

Rust version (Compiler Explorer)

pub fn test2(x: u64, y: u64) -> u32 {
    x.trailing_zeros().max(y.trailing_zeros())
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
QuarticCat
  • 1,314
  • 6
  • 20
  • 3
    Unit tests: https://godbolt.org/z/1hY4ch9sh – Marek R Jun 01 '23 at 11:37
  • 2
    Note that specifying processor architecture changes code to something more nice. In such case clang nails it and makes it branchless: https://godbolt.org/z/dWse6hxbY – Marek R Jun 01 '23 at 12:04
  • 3
    On ARM, `ctz(x)` is implemented as `clz(rbit(x))`. And since we have `max(clz(x), clz(y)) = clz(min(x,y))`, that lets us do `clz(min(rbit(x), rbit(y)))` which saves one `clz`. (And `min` is easy to do branchless on this architecture.) So it probably helps to know how your architecture actually does `ctz`, – Nate Eldredge Jun 02 '23 at 03:44
  • 1
    Any specific architectures you care about? A lot of discussion so far has involved modern x86. Can you assume BMI1 instructions? Are zeroed inputs possible, which would require care if using x86 `bsf`. – Peter Cordes Jun 02 '23 at 09:16
  • 2
    @PeterCordes In my actual work, I mainly focus on x86_64 and aarch64 with default target flag and native target flag. But I'm glad to see people discuss different situations. I don't want this question to be too specific to be helpless to others who viewed this page. – QuarticCat Jun 02 '23 at 12:14
  • Ok, thanks. So Nate's comment about `ctz` being implemented in terms of `rbit` is relevant. (BTW, the English word you're looking for is "useless"; "helpless" means unable to help (especially defend) *itself*, e.g. like a newborn kitten.) – Peter Cordes Jun 02 '23 at 14:30
  • 1
    @PeterCordes Yeah, so I upvoted that comment. And thanks for teaching me English. :) – QuarticCat Jun 02 '23 at 14:50

4 Answers4

24

I don't think there's anything better than the naive approach for the maximum. One attempt is using the identity

x + y = min(x, y) + max(x, y)

and thus

max(ctz(x), ctz(y)) = ctz(x) + ctz(y) - min(ctz(x), ctz(y))

This way, we can reduce the max function to the min function we already optimized, albeit with a few additional operations.

Here are some Rust implementations of the different approaches:

pub fn naive(x: u64, y: u64) -> u32 {
    x.trailing_zeros().max(y.trailing_zeros())
}

pub fn sum_minus_min(x: u64, y: u64) -> u32 {
    x.trailing_zeros() + y.trailing_zeros() - (x | y).trailing_zeros()
}

pub fn nielsen(x: u64, y: u64) -> u32 {
    let x_lsb = x & x.wrapping_neg();
    let y_lsb = y & y.wrapping_neg();
    let xy_lsb = x_lsb | y_lsb;
    let lsb = xy_lsb & xy_lsb.wrapping_neg();
    let xy_max_lsb = if xy_lsb == lsb { lsb } else { xy_lsb ^ lsb };
    xy_max_lsb.trailing_zeros()
}

pub fn timmermans(x: u64, y: u64) -> u32 {
    let loxs = !x & x.wrapping_sub(1);
    let loys = !y & y.wrapping_sub(1);
    return (loxs | loys).count_ones();
}

pub fn kealey(x: u64, y: u64) -> u32 {
    ((x | x.wrapping_neg()) & (y | y.wrapping_neg())).trailing_zeros()
}

Results on my machine:

ctz_max/naive           time:   [279.09 ns 279.55 ns 280.10 ns]
ctz_max/sum_minus_min   time:   [738.91 ns 742.87 ns 748.61 ns]
ctz_max/nielsen         time:   [935.35 ns 937.63 ns 940.40 ns]
ctz_max/timmermans      time:   [803.39 ns 806.98 ns 810.76 ns]
ctz_max/kealey          time:   [295.03 ns 295.93 ns 297.03 ns]

The naive implementation beats all other implementations. The only implementation that can compete with the naive one is the approach suggested by Martin Kealey. Note that the actual factors between the implementation may be even higher than the timings indicate, due to some overhead of the test harness.

It's clear that you only have like a couple of CPU instructions to spare to optimize the naive implementation, so I don't think there is anything you can do. For reference, here is the assembly emitted by the Rust compiler when these implementations are compiled as standalone functions on a modern x86_64 processor:

example::naive:
        tzcnt   rcx, rdi
        tzcnt   rax, rsi
        cmp     ecx, eax
        cmova   eax, ecx
        ret

example::sum_minus_min:
        tzcnt   rcx, rdi
        tzcnt   rax, rsi
        add     eax, ecx
        or      rsi, rdi
        tzcnt   rcx, rsi
        sub     eax, ecx
        ret

example::nielsen:
        blsi    rax, rdi
        blsi    rcx, rsi
        or      rcx, rax
        blsi    rax, rcx
        xor     edx, edx
        cmp     rcx, rax
        cmovne  rdx, rcx
        xor     rdx, rax
        tzcnt   rax, rdx
        ret

example::timmermans:
        lea     rax, [rdi - 1]
        andn    rax, rdi, rax
        lea     rcx, [rsi - 1]
        andn    rcx, rsi, rcx
        or      rcx, rax
        xor     eax, eax
        popcnt  rax, rcx
        ret

example::kealey:
        mov     rax, rdi
        neg     rax
        or      rax, rdi
        mov     rcx, rsi
        neg     rcx
        or      rcx, rsi
        and     rcx, rax
        tzcnt   rax, rcx
        ret

In the benchmarks I ran, the functions get inlined, the loops partially unrolled and some subexpressions pulled out of the inner loops, so the assembly looks a lot less clean that the above.

For testing, I used Criterion. Here is the additional code:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

const NUMBERS: [u64; 32] = [
    ...
];

fn bench<F>(func: F)
where
    F: Fn(u64, u64) -> u32,
{
    for x in NUMBERS {
        for y in NUMBERS {
            black_box(func(x, y));
        }
    }
}

fn compare(c: &mut Criterion) {
    let mut group = c.benchmark_group("ctz_max");
    group.bench_function("naive", |b| b.iter(|| bench(naive)));
    group.bench_function("sum_minus_min", |b| b.iter(|| bench(sum_minus_min)));
    group.bench_function("nielsen", |b| b.iter(|| bench(nielsen)));
    group.bench_function("timmermans", |b| b.iter(|| bench(timmermans)));
    group.bench_function("kealey", |b| b.iter(|| bench(kealey)));
}

criterion_group!(benches, compare);
criterion_main!(benches);

NUMBERS was generated with this Python code, with the intention of making branch prediction for the min() function as hard as possible:

[
    random.randrange(2 ** 32) * 2 ** random.randrange(32)
    for dummy in range(32)
]

I'm running the benchmark using

RUSTFLAGS='-C target-cpu=native -C opt-level=3' cargo bench

on an 8th generation i7 processor (Whiskey Lake).

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • You might want to accumulate a sum of all the results and throw if it's incorrect, just to make sure that nothing important is being optimized away. Also use -O3, and anything you might need to do to enable inlining in rust. – Matt Timmermans Jun 01 '23 at 17:09
  • @MattTimmermans `cargo bench` does optimized builds automatically. The default is using the `-O` option to rustc, which is equivalent to `-O2` for clang. I tried with `-O opt-level=3` as well, which degrades the naive implementation by 5% and improves all other versions by 5%. I used `black_box()` to avoid that the function return values are optimized away. If I remove `black_box()`, the entire code is optimized away, and all timings are exactly 0. Inlining happens automatically in optimized builds, and I verified the assembly to ensure that the functions actually got inlined. – Sven Marnach Jun 01 '23 at 17:18
  • Unfortunate that Rustc/LLVM picked `cmova` which is 2 uops ([since it needs 4 inputs including CF and the SPAZO group for ZF](https://stackoverflow.com/questions/49867597/what-is-a-partial-flag-stall/49868149#49868149)), instead of `cmovb` or `cmovae` which are only 1 uop on Broadwell and later, including Skylake-family. (They only need CF.) Yeah, really hard to be 2x `tzcnt` / `cmp`/`cmov`, especially on AMD CPUs or Skylake or later where `tzcnt` doesn't [have false dependencies](https://stackoverflow.com/questions/21390165/). Its 1/clock throughput on Intel is almost certainly fine. – Peter Cordes Jun 02 '23 at 09:22
  • Given the variation in timings, and LLVM's general recklessness with false dependencies (preferring not to spend uops on xor-zeroing unless it fully sees the loop containing the false dep), it might be bottlenecking on tzcnt latency not throughput in some of the tests? But no, your Whiskey Lake CPU doesn't have tzcnt false deps so that can't be it. – Peter Cordes Jun 02 '23 at 09:27
  • 1
    @PeterCordes The actual benchmark timings are rather noisy, and the full assembly of the functions inlined into the benchmarking loop is rather complex and hard to understand. From the machine code of the isolated functions alone, it's impossible to explain the timings I've observed, and the timings vary based on factors like whether the functions are defined in the same crate, even if they are inlined. However, one result has been consistent: Whatever I did, the naive implementation was fastest on my machine. – Sven Marnach Jun 02 '23 at 09:39
  • I'm flattered to have come second. – Martin Kealey Jun 02 '23 at 13:00
  • @MartinKealey Unfortunately, it's not as clear-cut as the simple timings I presented indicate. The main reason your version performed so well is that it somehow got better optimizations after loop-unrolling. As a standalone function, it's not really significantly better than the other options. Take the benchmark with a big grain of salt. – Sven Marnach Jun 02 '23 at 13:15
  • RISC-V without extension B doesn't have a popcount or countr_zero / countl_zero, unfortunately, so versions that do a bit of bit-manipulation to prepare an input for one `ctz` are probably worth it on RV64. Like perhaps Timmermans's or @MartinKealey versions. The x86 version needed some `mov` register-copy instructions, but a 3-operand RISC wouldn't. – Peter Cordes Jun 02 '23 at 17:04
  • ARM/AArch64 does `ctz` as `clz(rbit)` with scalar instructions, but popcount needs to copy to a SIMD vector register and back for `cnt` / `uaddlv`, so any transformation to `popcount` will be even worse than needing `rbit` ahead of a cheap `clz`. (https://godbolt.org/z/Pr5dMEbn4). Anyway, spending more instructions to set up for a single `ctz` is still not worth it on AArch64, but is less bad than on x86 with BMI1. (In other cases with clz but not ctz, and without cheap bit-rev (e.g. SIMD with AVX-512 `vplznctq`) another `czt` for non-zero inputs is `63-clz(isolate_low_bit(x))`.) – Peter Cordes Jun 02 '23 at 19:40
  • (As the Godbolt link shows, `ctz` is not a disaster without HW support on RISC-V; clang isolates the lowest set bit and multiplies by a 64-bit de Bruijn sequence which leaves the 6-bit position in the top 6 bits. Err wait, a 6-bit index to a lookup table of bytes :/ Yuck.) So it's not great, especially with it branching on zero as a special case for each `clz` in the "naive" version.) – Peter Cordes Jun 02 '23 at 19:48
18

These are equivalent:

  • max(ctz(a),ctz(b))
  • ctz((a|-a)&(b|-b))
  • ctz(a)+ctz(b)-ctz(a|b)

The math-identity ctz(a)+ctz(b)-ctz(a|b) requires 6 CPU instructions, parallelizable to 3 steps on a 3-way superscalar CPU:

  • 3× ctz
  • 1× bitwise-or
  • 1× addition
  • 1× subtraction

The bit-mashing ctz((a|-a)&(b|-b)) requires 6 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:

  • 2× negation
  • 2× bitwise-or
  • 1× bitwize-and
  • 1× ctz

The naïve max(ctz(a),ctz(b)) requires 5 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:

  • 2× ctz
  • 1× comparison
  • 1× conditional branch
  • 1× load/move (so that the "output" is always in the same register)

... but note that branch instructions can be very expensive.

If your CPU has a conditional load/move instruction, this reduces to 4 CPU instructions taking 3 super-scalar steps.

If your CPU has a max instruction (e.g. SSE4), this reduces to 3 CPU instructions taking 2 super-scalar steps.

All that said, the opportunities for super-scalar operation depend on which instructions you're trying to put against each other. Typically you get the most by putting different instructions in parallel, since they use different parts of the CPU (all at once). Typically there will be more "add" and "bitwise or" units than "ctz" units, so doing multiple ctz instructions may actually be the limiting factor, especially for the "math-identity" version.

If "compare and branch" is too expensive, you can make a non-branching "max" in 4 CPU instructions. Assuming A and B are positive integers:

  1. C = A-B
  2. subtract the previous carry, plus D, from D itself (D is now either 0 or -1, regardless of whatever value it previously held)
  3. C &= D (C is now min(0, A-B))
  4. A -= C (A' is now max(A,B))
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
Martin Kealey
  • 546
  • 2
  • 11
  • I like the second option. It is the simplest alternative to the naive solution and I think what the OP was looking for (though theoretically the language lawyer must use `~a+1` instead of `-a` until C23 specifies twos complement). – nielsen Jun 01 '23 at 20:45
  • @nielsen `-a` is already OK for unsigned types (though MSVC may unreasonably complain and force you to write `0 - a` instead, which is also OK) E: here's a reference, https://stackoverflow.com/q/8026694/555045 – harold Jun 01 '23 at 21:03
  • Also note that every CPU with SSE4 has native max instructions for 64-bit integers. – Sven Marnach Jun 01 '23 at 21:18
  • 1
    The second option is comparable with the naive one on Haswell and Skylake with default compile flags (i.e. no `tzcnt`), according to llvm-mca https://godbolt.org/z/a81ceGWPc. Although llvm-mca shows the naive one costs a bit fewer instructions, that's because it cannot predict branch cost. I believe that is the farthest place we can reach, so I gonna accept this answer. With `tzcnt`, maybe no code can beat the naive one. – QuarticCat Jun 01 '23 at 21:23
  • And, I'm glad to learn some bit tricks that I haven't know (`a | -a`). – QuarticCat Jun 01 '23 at 21:28
  • @harold You are right, of course. So I like the second option. Full stop :-) – nielsen Jun 01 '23 at 21:28
  • I agree this answers the question. However, I am still a bit curious about how the options compare when the operands are wider than what can be handled by single instructions on the CPU. E.g. 32-bit operands on an 8-bit CPU. – nielsen Jun 01 '23 at 21:36
  • @nielsen the number of CPU instructions would directly scale with the number of ALU words per operand. The only notable point is that 8-bit CPUs generally lacked 3-operand opcodes, so that `c=a-b` would need to be split into `c=a` `c-=b`. And of course they lacked any superscalar capability, so the timing would be directly proportional to the number of CPU instructions plus some overhead for a branch. – Martin Kealey Jun 02 '23 at 08:10
  • @nielsen agreed with the language lawyering. I was writing C-like instructions intending them to be read as CPU instructions, and all current (non virtual) CPUs do indeed use 2's complement. – Martin Kealey Jun 02 '23 at 08:17
  • 2
    Note that non-branching max is usually implemented using a conditional move, e.g. `cmov` on x86_64. – Sven Marnach Jun 02 '23 at 08:20
  • @SvenMarnach noted and fixed – Martin Kealey Jun 02 '23 at 13:23
11

You can do it like this:

#include <algorithm>
#include <bit>
#include <cstdint>

int32_t maxr_zero(uint64_t x, uint64_t y) {
    uint64_t loxs = ~x & (x-1); // low zeros of x
    uint64_t loys = ~y & (y-1); // low zeros of y
    return std::countr_zero((loxs|loys)+1);
}
Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • 6
    Even something as simple as this will already use far too many CPU instructions to compete with the naive implementation. CTZ is a single, fast machine instruction on modern CPUs, so the naive implementation is really hard to beat. – Sven Marnach Jun 01 '23 at 12:23
  • had a bug. fixed it. – Matt Timmermans Jun 01 '23 at 12:27
  • Now we're talking beauty! It passes [the enhanced unit test](https://godbolt.org/z/nYvjosdeq) – Ted Lyngmo Jun 01 '23 at 12:34
  • 2
    I benchmarked a Rust version of this, and it's much slower than the naive implementation. – Sven Marnach Jun 01 '23 at 12:41
  • @SvenMarnach I don't think there's any way to do `max` without a branch. – Matt Timmermans Jun 01 '23 at 12:49
  • Nice solution and computationally better than my answer (fewer operations and avoids branching). I would not put too much into benchmarks for a particular system. It is plausible to claim that this method might be faster on some systems in some cases (e.g. if the if the bit width of the input is larger than the CPU bit width or if there is no CTZ CPU instruction). – nielsen Jun 01 '23 at 12:58
  • @MattTimmermans That's not true. For floating-point numbers there have ben native max and min instructions for a long time, and there are variants for integers in SSE4. But even without using these, I benchmarked it, and branching is actually faster than any of the more complex approaches. The naive version is the fastest one. (It looks like people don't like this answer, but what can I do.) – Sven Marnach Jun 01 '23 at 12:59
  • 1
    Both GCC and Clang used `cmov` to implement the `max` (but GCC also goes nuts and reintroduces a redundant branch to test whether `y` is zero, and a redundant `test\cmov` pair to test if `x` is zero) – harold Jun 01 '23 at 13:00
  • 1
    Oh, right. I'm not used to thinking about x86 assembler. A naive version that uses `cmov` for the `max` can be strictly faster. – Matt Timmermans Jun 01 '23 at 13:02
  • @SvenMarnach I do not doubt that you are right that the naive version is the fastest for 64 bit operands on the system you are benchmarking on. I only doubt that the same conclusion holds for all (including embedded) systems around and for larger operand sizes. – nielsen Jun 01 '23 at 13:36
  • 1
    I think you can improve this slightly by using `std::popcount(loxs | loys)`. Just saves one addition but hey it's something – harold Jun 01 '23 at 14:09
  • @harold Doing that degrades performance significantly on my machine. The point of this question appears to be beauty rather than performance, and your suggestion would make an improvement in that regard. – Sven Marnach Jun 01 '23 at 14:19
  • 1
    @SvenMarnach shouldn't happen, did you forget to allow the compiler to use modern instructions? `popcnt` shouldn't be slower than `tzcnt` (or `bsf`) except on some oddball CPUs like AMD Piledriver, but it would work out that way without modern instructions: if `countr_zero` was done with `bsf` and `popcount` emulated. – harold Jun 01 '23 at 14:28
  • @harold I'm using `RUSTFLAGS='-C target-cpu=native' cargo bench` on an 8th-generation i7 processor. It's from 2019, but it should have popcount and all. I'll look at the generated code. – Sven Marnach Jun 01 '23 at 14:36
  • @harold I just reran it, and now the performance actually imrproved with this change. It's still nowhere close to the naive implementation, but I'll update my answer. – Sven Marnach Jun 01 '23 at 14:39
  • @SvenMarnach: `std::popcount(loxs | loys)` should compile to a `popcnt` instruction, which has a false dependency on your Skylake-derived CPU. Unlike `countr_zero` (`tzcnt`). Other than that, they have the same latency and throughput for the same execution unit. If Rustc/LLVM isn't careful about [the false dependency](https://stackoverflow.com/q/21390165/224132) (which it often isn't), it could [bottleneck](https://stackoverflow.com/q/25078285/224132) on latency instead of throughput. Or maybe your benchmark loop is partially defeated by some compiler optimizations? – Peter Cordes Jun 02 '23 at 17:21
  • Unless like harold suggested you compile for baseline x86-64 where `bsf` is available but not `popcnt`. You wouldn't have this problem with x86-64-v2 (SSE4.2 + popcnt) or v3 (AVX2+FMA+BMI1/2 for `tzcnt`). In fact, with x86-64-v2 (`popcnt` but not `tzcnt`), using `popcnt` avoids any problems with all-zero inputs that compilers would have to branch to avoid, since only `tzcnt` produces the `64` that `countr_zero` wants. https://godbolt.org/z/cvsrqoees But it's a lot of bithack work without `andn`, so probably still worse than the "naive" version. – Peter Cordes Jun 02 '23 at 17:30
1

I am not sure whether or not it is faster, but this function will take x and y and calculate the input to ctz for getting the max value:

uint64_t getMaxTzInput(uint64_t x, uint64_t y)
{
   uint64_t x_lsb = x & (~x + 1);  // Least significant 1 of x
   uint64_t y_lsb = y & (~y + 1);  // Least significant 1 of y
   uint64_t xy_lsb = x_lsb | y_lsb;  // Least significant 1s of x and y (could be the same)
   uint64_t lsb = (xy_lsb) & (~(xy_lsb)+1);  // Least significant 1 among x and y

   // If the least significant 1s are different for x and y, remove the least significant 1
   // to get the second least significant 1.
   uint64_t xy_max_lsb = (xy_lsb == lsb) ? lsb : xy_lsb ^ lsb;
   return xy_max_lsb;
}

Thus, ctz(getMaxTzInput(x,y)) should at least give the correct value with only one call of ctz.

nielsen
  • 5,641
  • 10
  • 27
  • 1
    ... and it's passing [Marek's unit test](https://godbolt.org/z/W11xKdWWW) – Ted Lyngmo Jun 01 '23 at 12:21
  • ... and it's passing my enhanced version of [Marek's unit test](https://godbolt.org/z/s6xM76Wa8) too which includes the case `{0, 0, 64}` and also checks for UB (which my own solution failed). – Ted Lyngmo Jun 01 '23 at 12:32
  • 1
    But it's still much slower and much more complex than the naive implementation. (I measured with a Rust version of this code.) – Sven Marnach Jun 01 '23 at 12:35
  • 4
    Note that `(~x + 1)` is just a fancy way of writing `-x`. – Sven Marnach Jun 01 '23 at 19:10
  • Your code assumes both values are non-zero. `max_ctz(2,0)` should be 64 if done the naive way, but your function returns 2, so ctz(2)==1. But for the case of non-zero inputs, we can simplify the final step. `lsb = xy_lsb & (xy_lsb - 1);` (clear the lowest set) `return lsb ? lsb : xy_lsb`. If clearing the lowest bit of the OR result produced zero, the bits were at the same place, so return the value from before doing that. i.e. just a `cmov` or `csel` using flags from the `and` or `blsr`. (5 instructions vs. your 8 with x86 BMI1, or 8 vs. 10 with AArch64: https://godbolt.org/z/73j7xzedf) – Peter Cordes Jun 03 '23 at 17:47
  • `63-clz(xy_lsb)` would also work, which IIRC can be done with `clz(xy_lsb) ^ 63` since we know the result will be in the 0..63 range for non-zero inputs. So BMI1 `blsi` on x, y, `or`, `lzcnt`, `xor reg,63`. (https://godbolt.org/z/5ePPzdfh6) Or just `bsr` (returns bit-index of highest set bit if non-zero) instead of `lzcnt` (returns leading-zero count) since we only care about non-zero inputs, if we could convince a compiler to emit `bsr` when BMI1 was available. `bsr` is slow on AMD, though. I know the XOR trick is from GCC using `bsr` / `xor` to implement `countr_zero` / `__builtin_clz`. – Peter Cordes Jun 03 '23 at 17:53