For RISC-V you're probably using GCC/clang.
Fun fact: GCC knows some of these SWAR bithack tricks (shown in other answers) and can use them for you when compiling code with GNU C native vectors for targets without hardware SIMD instructions. (But clang for RISC-V will just naively unroll it to scalar operations, so you do have to do it yourself if you want good performance across compilers).
One advantage to native vector syntax is that when targeting a machine with hardware SIMD, it will use that instead of auto-vectorizing your bithack or something horrible like that.
It makes it easy to write vector -= scalar
operations; the syntax Just Works, implicitly broadcasting aka splatting the scalar for you.
Also note that a uint64_t*
load from a uint8_t array[]
is strict-aliasing UB, so be careful with that. (See also Why does glibc's strlen need to be so complicated to run quickly? re: making SWAR bithacks strict-aliasing safe in pure C). You may want something like this to declare a uint64_t
that you can pointer-cast to access any other objects, like how char*
works in ISO C / C++.
use these to get uint8_t data into a uint64_t for use with other answers:
// GNU C: gcc/clang/ICC but not MSVC
typedef uint64_t aliasing_u64 __attribute__((may_alias)); // still requires alignment
typedef uint64_t aliasing_unaligned_u64 __attribute__((may_alias, aligned(1)));
The other way to do aliasing-safe loads is with memcpy
into a uint64_t
, which also removes the alignof(uint64_t
) alignment requirement. But on ISAs without efficient unaligned loads, gcc/clang don't inline and optimize away memcpy
when they can't prove the pointer is aligned, which would be disastrous for performance.
TL:DR: your best bet is to declare you data as uint64_t array[...]
or allocate it dynamically as uint64_t
, or preferably alignas(16) uint64_t array[];
That ensures alignment to at least 8 bytes, or 16 if you specify alignas
.
Since uint8_t
is almost certainly unsigned char*
, it's safe to access the bytes of a uint64_t
via uint8_t*
(but not vice versa for a uint8_t array). So for this special case where the narrow element type is unsigned char
, you can sidestep the strict-aliasing problem because char
is special.
GNU C native vector syntax example:
GNU C native vectors are always allowed to alias with their underlying type (e.g. int __attribute__((vector_size(16)))
can safely alias int
but not float
or uint8_t
or anything else.
#include <stdint.h>
#include <stddef.h>
// assumes array is 16-byte aligned
void dec_mem_gnu(uint8_t *array) {
typedef uint8_t v16u8 __attribute__ ((vector_size (16), may_alias));
v16u8 *vecs = (v16u8*) array;
vecs[0] -= 1;
vecs[1] -= 1; // can be done in a loop.
}
For RISC-V without any HW SIMD, you could use vector_size(8)
to express just the granularity you can efficiently use, and do twice as many smaller vectors.
But vector_size(8)
compiles very stupidly for x86 with both GCC and clang: GCC uses SWAR bithacks in GP-integer registers, clang unpacks to 2-byte elements to fill a 16-byte XMM register then repacks. (MMX is so obsolete that GCC/clang don't even bother using it, at least not for x86-64.)
But with vector_size (16)
(Godbolt) we get the expected movdqa
/ paddb
. (With an all-ones vector generated by pcmpeqd same,same
). With -march=skylake
we still get two separate XMM ops instead of one YMM, so unfortunately current compilers also don't "auto-vectorize" vector ops into wider vectors :/
For AArch64, it's not so bad to use vector_size(8)
(Godbolt); ARM/AArch64 can natively work in 8 or 16-byte chunks with d
or q
registers.
So you probably want vector_size(16)
to actually compile with if you want portable performance across x86, RISC-V, ARM/AArch64, and POWER. However, some other ISAs do SIMD within 64-bit integer registers, like MIPS MSA I think.
vector_size(8)
makes it easier to look at the asm (only one register worth of data): Godbolt compiler explorer
# GCC8.2 -O3 for RISC-V for vector_size(8) and only one vector
dec_mem_gnu(unsigned char*):
lui a4,%hi(.LC1) # generate address for static constants.
ld a5,0(a0) # a5 = load from function arg
ld a3,%lo(.LC1)(a4) # a3 = 0x7F7F7F7F7F7F7F7F
lui a2,%hi(.LC0)
ld a2,%lo(.LC0)(a2) # a2 = 0x8080808080808080
# above here can be hoisted out of loops
not a4,a5 # nx = ~x
and a5,a5,a3 # x &= 0x7f... clear high bit
and a4,a4,a2 # nx = (~x) & 0x80... inverse high bit isolated
add a5,a5,a3 # x += 0x7f... (128-1)
xor a5,a4,a5 # x ^= nx restore high bit or something.
sd a5,0(a0) # store the result
ret
I think it's the same basic idea as the other non-looping answers; preventing carry then fixing up the result.
This is 5 ALU instructions, worse than the top answer I think. But it looks like critical path latency is only 3 cycles, with two chains of 2 instructions each leading to the XOR. @Reinstate Monica - ζ--'s answer compiles to a 4-cycle dep chain (for x86). The 5-cycle loop throughput is bottlenecked by also including a naive sub
on the critical path, and the loop does bottleneck on latency.
However, this is useless with clang. It doesn't even add and store in the same order it loaded so it's not even doing good software pipelining!
# RISC-V clang (trunk) -O3
dec_mem_gnu(unsigned char*):
lb a6, 7(a0)
lb a7, 6(a0)
lb t0, 5(a0)
...
addi t1, a5, -1
addi t2, a1, -1
addi t3, a2, -1
...
sb a2, 7(a0)
sb a1, 6(a0)
sb a5, 5(a0)
...
ret