RISCV branchless coding

Question

On Intel AVX, there is a possibility of branchless code. Instead of branching for case0 or case1, you can compute both cases, and blend the results based on a condition.

AVX does this 8 way for float using the vblendps instruction.

You can also do this in a scalar way, without a vector, using the x86 instruction CMOVcc which performs a move operation, conditionally.

NOTE: ARM has CSEL and NEON has VBSL.

Can RISCV64 do a scalar move like this, so that you do not have to branch for

a = c ? x : y;

As I understand, RISCV implementations are in-order, so it would benefit even more than x86 when not having to branch. (The latter can at least shuffle around some instructions, and even branch speculatively to hide latency.)

The closest I can find w.r.t branchless operation for riscv is SLT (Set Less Than) but that sets to 1 or 0, and then would need multiplications? Wouldn't it be more useful to have SLT set to -1 or 0 instead, so that we can AND that?

UPDATE

When doing:

int foo(int a, int b, int x, int y)
{
    return a < b ? x : y;
}

I tried a poor-man's version of branchless using SLT. I am not sure if I did it completely right, by using bitmask as 0 - condition(0|1), I came up with:

branchless:
    SLT t0,a0,a1
    SUB t0,zero,t0
    NOT t1,t0
    AND t0,a2,t0
    AND t1,a3,t1
    OR  a0,t0,t1
    RET
    .size   branchless, .-branchless

as the branchless version of:

branched:
    BGE a0,a1,.L2
    MV  a3,a2
.L2:
    MV  a0,a3
    RET
    .size   branched, .-branched

I wonder if I used too many instructions for this, but I measured the branching version to be slightly faster than the non-branching one on random data, but not by much.

*and then would need multiplications?* - or `sub` / `and` on the opposite condition. Yes, for bithacks and branchless stuff, 0 / -1 would be more useful. But since C implementations typically use a `bool` whose object representation must be 0 / 1 to allow cheaper conversion to int, that's what MIPS and RISC-V did for their compare-into-register instructions. (And/or possibly other reasons.) — Peter Cordes, May 22 '22 at 21:45
https://docs.boom-core.org/en/latest/sections/intro-overview/boom.html — Erik Eidt, May 22 '22 at 21:52
@Bram: To get your 0/-1 mask, you do `mask = (c==0)-1`. Like x86 `test`/`setcc` / `dec`. It might or might not actually be worth doing on RISC-V, depending on the microarchitecture and how unpredictable it is. — Peter Cordes, May 22 '22 at 21:57
@PeterCordes Ah, got it. Thanks again. I will try to bench that against branch on the Sipeed Lichee RV86 that I have here. — Bram, May 22 '22 at 21:59
They left out shift and add (scaling addition) from the original, apparently thinking that compressed instructions plus fusing would do the job. (The compressed instructions are limited to a subset of registers.) — Erik Eidt, May 22 '22 at 22:38

score 10 · Accepted Answer · answered May 22 '22 at 22:48

The proposed RISC-V extension B includes cmov (with 4 operands: 3 inputs and a separate destination!).

I think David Patterson (one of the lead architects behind MIPS and RISC-V) really dislikes cmov (along with short-vector SIMD like SSE/AVX) and thinks CPUs should specially handle "hammock" branches (that jump forward over a single instruction like a move) if they want to do that. Something like that. So this seems to be a case of philosophical purity getting in the way of including useful instructions. (AArch64 is a much more pragmatic design, still being RISC in the ways that matter for a high-performance implementation.)

And/or perhaps a desire to limit instructions to at most 2 inputs, if there aren't any other 3-input instructions. That means a scalar pipeline only needs 2 register read ports, not 3, if it strictly follows this restriction. (That also means no add-with-carry, making extended-precision math quite a pain for numbers wider than 2 registers, when you have to deal with carry-in and carry-out to the same add operation.)

You can emulate cmov as you say with a mask for AND/ANDnot/OR, but that would take quite a few instructions and is usually not worth it except possibly on wide and deep out-of-order machines, where the amount of work discarded by a branch miss is a lot bigger. (mask = (c == 0) - 1; which you can do with sltiu / add reg,reg, -1 to turn 0 into -1 and 1 into 0.)

You kind of have it backwards in terms of which kind of microarchitecture benefits more from CMOV, although there are potential benefits either way. And an in-order machine already kind of has to wait at a conditional branch for the condition to resolve, vs. an out-of-order machine treating control dependencies very differently from data dependencies. As discussed in gcc optimization flag -O3 makes code slower than -O2, data dependencies through cmov can create a loop-carried dependency chain that's a bigger bottleneck that highly predictable branches.

There are some out-of-order exec RISC-V designs, maybe even some that are open-source. For example, Erik Eidt linked The Berkeley Out-of-Order Machine (BOOM).

Extension B: where they put all the fun instructions they left out

The RISC-V extension B proposal has a conditional move, along with scalar min/max, popcount, leading/trailing zero count, bitfield insert/extract, two-register shifts, and a bunch of more esoteric stuff. https://five-embeddev.com/riscv-bitmanip/draft/bext.html#conditional-move-cmov

Looking at the list of proposed instructions, it's amazing what got left out of baseline RISC-V, like sign-extension of narrow integers (currently requires slli/srai) if it's not already guaranteed by the calling convention or a load instruction, and standard stuff like popcount and leading/trailing zero count that most ISAs have.

Godbolt shows clang 12.0 using cmov, min, and sext.b. In that clang version, -O3 -Wall -menable-experimental-extensions -march=rv32gcb0p93 was the magic incantation to do that. Extension B 0.93 is enabled by the b0p93 part of the string. (Extension B isn't finalized, and I don't know what version clang 14.0 was looking for; its error message wasn't helpful, and just plain -march=rv32gcb didn't get the compiler to actually use cmov.)

//  -march=rv32gcb0p93 includes extension b 0.93 (0p93)

int sel(int x, int y, int c){
    return c ? x : y;
}
# extension B  clang
        cmov    a0, a2, a0, a1
        ret

# baseline gcc11.3  (clang and GCC12 waste several mv instructions)
        bne     a2,zero,.L2
        mv      a0,a1
.L2:
        ret

int min(int x, int y, int c){
    return (x<y) ? x : y;
}
# extension B  clang
        min     a0, a0, a1
        ret

# baseline gcc
        ble     a0,a1,.L5
        mv      a0,a1
.L5:
        ret

int sext(int c){
    return (signed char)c;
}
# extension B  clang
        sext.b  a0, a0
        ret

# baseline gcc
        slli    a0,a0,24
        srai    a0,a0,24
        ret

Thanks for the elaborate answer. A great find! I guess the lack of conditional moves really starts to hurt when doing SIMD, because you can't branch differently for each element. I guess you would really need that B extension when you use vectors? — Bram, May 22 '22 at 23:30
@Bram: Existence of scalar `cmov` or not is unrelated to what SIMD instructions are available. RISC-V vector extensions (which allow hardware to provide whatever actual length it wants, like ARM SVE) I think have first-class masking support like AVX-512, at least for loads/stores. And I think have compare-into-mask. There was also a RISC-V proposed extension with short-vector SIMD; IDK if it had an instruction like x86 `vlendps` or if you'd have to emulate it like x86 used to need before SSE4.1 with `and`/`andn`/`or`. — Peter Cordes, May 23 '22 at 00:06

RISCV branchless coding

UPDATE

1 Answers1

Extension B: where they put all the fun instructions they left out

Linked