Subtracting packed 8-bit integers in an 64-bit integer by 1 in parallel, SWAR without hardware SIMD

Question

If I have a 64-bit integer that I'm interpreting as an array of packed 8-bit integers with 8 elements. I need to subtract the constant 1 from each packed integer while handling overflow without the result of one element affecting the result of another element.

I have this code at the moment and it works but I need a solution that does the subtraction of each packed 8-bit integer in parallel and doesn't make memory accesses. On x86 I could use SIMD instructions like psubb that subtracts packed 8-bit integers in parallel but the platform I'm coding for doesn't support SIMD instructions. (RISC-V in this case).

So I'm trying to do SWAR (SIMD within a register) to manually cancel out carry propagation between bytes of a uint64_t, doing something equivalent to this:

uint64_t sub(uint64_t arg) {
    uint8_t* packed = (uint8_t*) &arg;

    for (size_t i = 0; i < sizeof(uint64_t); ++i) {
        packed[i] -= 1;
    }

    return arg;
}

I think you could do this with bitwise operators but I'm not sure. I'm looking for a solution that doesn't use SIMD instructions. I'm looking for a solution in C or C++ that's quite portable or just the theory behind it so I can implement my own solution.

Would this be any faster than a bunch of subtraction operations the compiler could optimize for you instead? Most CPUs can do more than one subtraction per clock per thread even without SIMD. Why not benchmark the naive approach vs. your SIMD one to see what kind of performance you get, and to see if it's acceptable. Be sure to fully optimize when compiling. — tadman, Jan 08 '20 at 00:24
Truth be told, without SIMD, the answer you have gets unrolled and doesn't involve any branching. It would probably execute pretty well on an out-of-order processor, better than the top answer by @n314159 (which I upvoted, because it's neat and exactly what you asked for): https://godbolt.org/z/BP3bvc — parktomatomi, Jan 08 '20 at 00:36
@parktomatomi Yes, I also thought about that. As I added in my answer, one could let the check on `mask_cp` not being 0 be. Then there would be no branching and better loop unrolling. — n314159, Jan 08 '20 at 00:43
Techniques for this sort of thing are called [SWAR](https://www.chessprogramming.org/SIMD_and_SWAR_Techniques) — harold, Jan 08 '20 at 00:52
related: [Compare 64-bit integers by segments](https://stackoverflow.com/q/50663315/995714) — phuclv, Jan 08 '20 at 01:47

score 77 · Accepted Answer · edited Jan 08 '20 at 23:13

If you have a CPU with efficient SIMD instructions, SSE/MMX paddb (_mm_add_epi8) is also viable. Peter Cordes' answer also describes GNU C (gcc/clang) vector syntax, and safety for strict-aliasing UB. I strongly encourage reviewing that answer as well.

Doing it yourself with uint64_t is fully portable, but still requires care to avoid alignment problems and strict-aliasing UB when accessing a uint8_t array with a uint64_t*. You left that part out of the question by starting with your data in a uint64_t already, but for GNU C a may_alias typedef solves the problem (see Peter's answer for that or memcpy).

Otherwise you could allocate / declare your data as uint64_t and access it via uint8_t* when you want individual bytes. unsigned char* is allowed to alias anything so that sidesteps the problem for the specific case of 8-bit elements. (If uint8_t exists at all, it's probably safe to assume it's an unsigned char.)

Note that this is a change from a prior incorrect algorithm (see revision history).

This is possible without looping for arbitrary subtraction, and gets more efficient for a known constant like 1 in each byte. The main trick is to prevent carry-out from each byte by setting the high bit, then correct the subtraction result.

We are going to slightly optimize the subtraction technique given here. They define:

SWAR sub z = x - y
    z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)

with H defined as 0x8080808080808080U (i.e. the MSBs of each packed integer). For a decrement, y is 0x0101010101010101U.

We know that y has all of its MSBs clear, so we can skip one of the mask steps (i.e. y & ~H is the same as y in our case). The calculation proceeds as follows:

We set the MSBs of each component of x to 1, so that a borrow cannot propagate past the MSB to the next component. Call this the adjusted input.
We subtract 1 from each component, by subtracting 0x01010101010101 from the corrected input. This does not cause inter-component borrows thanks to step 1. Call this the adjusted output.
We need to now correct the MSB of the result. We xor the adjusted output with the inverted MSBs of the original input to finish fixing up the result.

The operation can be written as:

#define U64MASK 0x0101010101010101U
#define MSBON 0x8080808080808080U
uint64_t decEach(uint64_t i){
      return ((i | MSBON) - U64MASK) ^ ((i ^ MSBON) & MSBON);
}

Preferably, this is inlined by the compiler (use compiler directives to force this), or the expression is written inline as part of another function.

Testcases:

in:  0000000000000000
out: ffffffffffffffff

in:  f200000015000013
out: f1ffffff14ffff12

in:  0000000000000100
out: ffffffffffff00ff

in:  808080807f7f7f7f
out: 7f7f7f7f7e7e7e7e

in:  0101010101010101
out: 0000000000000000

Performance details

Here's the x86_64 assembly for a single invocation of the function. For better performance it should be inlined with the hope that the constants can live in a register as long as possible. In a tight loop where the constants live in a register, the actual decrement takes five instructions: or+not+and+add+xor after optimization. I don't see alternatives that would beat the compiler's optimization.

uint64t[rax] decEach(rcx):
    movabs  rcx, -9187201950435737472
    mov     rdx, rdi
    or      rdx, rcx
    movabs  rax, -72340172838076673
    add     rax, rdx
    and     rdi, rcx
    xor     rdi, rcx
    xor     rax, rdi
    ret

With some IACA testing of the following snippet:

// Repeat the SWAR dec in a loop as a microbenchmark
uint64_t perftest(uint64_t dummyArg){
    uint64_t dummyCounter = 0;
    uint64_t i = 0x74656a6d27080100U; // another dummy value.
    while(i ^ dummyArg) {
        IACA_START
        uint64_t naive = i - U64MASK;
        i = naive + ((i ^ naive ^ U64MASK) & U64MASK);
        dummyCounter++;
    }
    IACA_END
    return dummyCounter;
}

we can show that on a Skylake machine, performing the decrement, xor, and compare+jump can be performed at just under 5 cycles per iteration:

Throughput Analysis Report
--------------------------
Block Throughput: 4.96 Cycles       Throughput Bottleneck: Backend
Loop Count:  26
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.5     0.0  |  1.5  |  0.0     0.0  |  0.0     0.0  |  0.0  |  1.5  |  1.5  |  0.0  |
--------------------------------------------------------------------------------------------------

(Of course, on x86-64 you'd just load or movq into an XMM reg for paddb, so it might be more interesting to look at how it compiles for an ISA like RISC-V.)

I just tested it myself and I'm satisfied. I think I was a bit unclear in my question but this sort of the solution I was looking for, it just uses a few simple operators and runs in constant time. — cam-white, Jan 08 '20 at 00:46
@cam-white As a follow-up: do you have MMX support on your target machine? MMX has [`paddb`](https://www.felixcloutier.com/x86/paddb:paddw:paddd:paddq); not sure of its performance but it could compare well. — nanofarad, Jan 08 '20 at 01:18
I need my code to run on RISC-V machines which doesn't have SIMD instructions (yet) let alone support for MMX — cam-white, Jan 08 '20 at 02:06
@cam-white Got it--this is probably the best you can do then. I'll hop onto godbolt to sanity check the assembly for RISC as well. Edit: No RISC-V support on godbolt :( — nanofarad, Jan 08 '20 at 02:07
RISC-V hasn't really taken off yet and there isn't really that much support for it but I'm sure your solution is fast, this is probably the fastest you could get on any machine without SIMD instructions — cam-white, Jan 08 '20 at 02:19
There is RISC-V support on godbolt actually, for example like [this](https://gcc.godbolt.org/z/FfQpf-) (E: seems that the compiler gets overly creative in creating the mask..) — harold, Jan 08 '20 at 03:12
@harold That's good to hear, but sadly the clang asm is a bit unfortunate... I'll play with it to see how I can optimize or guide the compiler into emitting the correct code; as a fallback I assume inline assembly might be the way to go. It looks like risc-v gcc seems to [emit correct code](https://gcc.godbolt.org/z/55iW-C). — nanofarad, Jan 08 '20 at 03:29
Further reading on how the parity (also called "carry-out vector") trick can be used in various situations: http://www.emulators.com/docs/LazyOverflowDetect_Final.pdf — jpa, Jan 08 '20 at 08:51
This doesn't seem to work; for 0x0000000000000100 we get 0x00000000000000ff instead of 0xffffffffffff00ff. — Falk Hüffner, Jan 08 '20 at 16:27
@ReinstateMonica-ζ-- there's also GCC RISC-V https://gcc.godbolt.org/z/Ty-M8f on godbolt. There's a godbolt fork for RISC-V on https://cx.rv8.io/ but looks like it's gone down — phuclv, Jan 08 '20 at 17:14
@cam-white: for subtracting 1 specifically, you don't need the general case of carry propagation. [Falk Hüffner's answer](https://stackoverflow.com/a/59650412/224132) is vastly more efficient (no looping, still only a handful of operations other than constructing the constants). — Peter Cordes, Jan 08 '20 at 18:39
@PeterCordes There's no looping in the actual answer presented. The loop is only used for static bytecode analysis since the tool I use for analyzing latency (IACA) only does so well when a tight loop is considered. — nanofarad, Jan 08 '20 at 18:40
@ReinstateMonica-ζ--: Ah I see now. I made an edit so future readers skimming like I was will avoid the same mistake. I was `while(i ^ mask) {` and assumed it was a manual carry-propagation loop; especially since you manually inlined `decEach` instead of just calling it with `i = decEach(i)` to make the latency test more obvious. I hadn't even noticed the actual function itself; I skimmed past assuming it was still part of the explanation or something. (I wasn't sure this was possible without looping so I was kind of looking for a loop, biasing my skimming in the wrong direction :P.) — Peter Cordes, Jan 08 '20 at 18:58
Added [an answer](https://stackoverflow.com/questions/59637731/subtracting-packed-8-bit-integers-in-an-64-bit-integer-by-1-in-parallel-swar-wi/59654202#59654202) showing GCC's native-vector SWAR for RISC-V. IDK if you want to link to that in your answer (and maybe remove some of the stuff I added to your intro paragraph, since it's now part of my answer). — Peter Cordes, Jan 08 '20 at 21:44
@PeterCordes Updated. Please feel free to edit further or let me know if anything else needs to be changed. — nanofarad, Jan 08 '20 at 22:55
I made another edit; GNU C native vectors actually *avoid* strict-aliasing problems; a vector-of-`uint8_t` is allowed to alias `uint8_t` data. Callers of your function (that need to get `uint8_t` data into a `uint64_t`) are the ones that have to worry about strict-aliasing! So probably the OP should just declare / allocate arrays as `uint64_t` because `char*` is allowed to alias anything in ISO C++, but not vice versa. — Peter Cordes, Jan 08 '20 at 23:15

Peter Cordes · Answer 2 · 2020-01-08T23:20:44.973

For RISC-V you're probably using GCC/clang.

Fun fact: GCC knows some of these SWAR bithack tricks (shown in other answers) and can use them for you when compiling code with GNU C native vectors for targets without hardware SIMD instructions. (But clang for RISC-V will just naively unroll it to scalar operations, so you do have to do it yourself if you want good performance across compilers).

One advantage to native vector syntax is that when targeting a machine with hardware SIMD, it will use that instead of auto-vectorizing your bithack or something horrible like that.

It makes it easy to write vector -= scalar operations; the syntax Just Works, implicitly broadcasting aka splatting the scalar for you.

Also note that a uint64_t* load from a uint8_t array[] is strict-aliasing UB, so be careful with that. (See also Why does glibc's strlen need to be so complicated to run quickly? re: making SWAR bithacks strict-aliasing safe in pure C). You may want something like this to declare a uint64_t that you can pointer-cast to access any other objects, like how char* works in ISO C / C++.

use these to get uint8_t data into a uint64_t for use with other answers:

// GNU C: gcc/clang/ICC but not MSVC
typedef uint64_t  aliasing_u64 __attribute__((may_alias));  // still requires alignment
typedef uint64_t  aliasing_unaligned_u64 __attribute__((may_alias, aligned(1)));

The other way to do aliasing-safe loads is with memcpy into a uint64_t, which also removes the alignof(uint64_t) alignment requirement. But on ISAs without efficient unaligned loads, gcc/clang don't inline and optimize away memcpy when they can't prove the pointer is aligned, which would be disastrous for performance.

TL:DR: your best bet is to declare you data as uint64_t array[...] or allocate it dynamically as uint64_t, or preferably alignas(16) uint64_t array[]; That ensures alignment to at least 8 bytes, or 16 if you specify alignas.

Since uint8_t is almost certainly unsigned char*, it's safe to access the bytes of a uint64_t via uint8_t* (but not vice versa for a uint8_t array). So for this special case where the narrow element type is unsigned char, you can sidestep the strict-aliasing problem because char is special.

GNU C native vector syntax example:

GNU C native vectors are always allowed to alias with their underlying type (e.g. int __attribute__((vector_size(16))) can safely alias int but not float or uint8_t or anything else.

#include <stdint.h>
#include <stddef.h>

// assumes array is 16-byte aligned
void dec_mem_gnu(uint8_t *array) {
    typedef uint8_t v16u8 __attribute__ ((vector_size (16), may_alias));
    v16u8 *vecs = (v16u8*) array;
    vecs[0] -= 1;
    vecs[1] -= 1;   // can be done in a loop.
}

For RISC-V without any HW SIMD, you could use vector_size(8) to express just the granularity you can efficiently use, and do twice as many smaller vectors.

But vector_size(8) compiles very stupidly for x86 with both GCC and clang: GCC uses SWAR bithacks in GP-integer registers, clang unpacks to 2-byte elements to fill a 16-byte XMM register then repacks. (MMX is so obsolete that GCC/clang don't even bother using it, at least not for x86-64.)

But with vector_size (16) (Godbolt) we get the expected movdqa / paddb. (With an all-ones vector generated by pcmpeqd same,same). With -march=skylake we still get two separate XMM ops instead of one YMM, so unfortunately current compilers also don't "auto-vectorize" vector ops into wider vectors :/

For AArch64, it's not so bad to use vector_size(8) (Godbolt); ARM/AArch64 can natively work in 8 or 16-byte chunks with d or q registers.

So you probably want vector_size(16) to actually compile with if you want portable performance across x86, RISC-V, ARM/AArch64, and POWER. However, some other ISAs do SIMD within 64-bit integer registers, like MIPS MSA I think.

vector_size(8) makes it easier to look at the asm (only one register worth of data): Godbolt compiler explorer

# GCC8.2 -O3 for RISC-V for vector_size(8) and only one vector

dec_mem_gnu(unsigned char*):
        lui     a4,%hi(.LC1)           # generate address for static constants.
        ld      a5,0(a0)                 # a5 = load from function arg
        ld      a3,%lo(.LC1)(a4)       # a3 = 0x7F7F7F7F7F7F7F7F
        lui     a2,%hi(.LC0)
        ld      a2,%lo(.LC0)(a2)       # a2 = 0x8080808080808080
                             # above here can be hoisted out of loops
        not     a4,a5                  # nx = ~x
        and     a5,a5,a3               # x &= 0x7f... clear high bit
        and     a4,a4,a2               # nx = (~x) & 0x80... inverse high bit isolated
        add     a5,a5,a3               # x += 0x7f...   (128-1)
        xor     a5,a4,a5               # x ^= nx  restore high bit or something.

        sd      a5,0(a0)               # store the result
        ret

I think it's the same basic idea as the other non-looping answers; preventing carry then fixing up the result.

This is 5 ALU instructions, worse than the top answer I think. But it looks like critical path latency is only 3 cycles, with two chains of 2 instructions each leading to the XOR. @Reinstate Monica - ζ--'s answer compiles to a 4-cycle dep chain (for x86). The 5-cycle loop throughput is bottlenecked by also including a naive sub on the critical path, and the loop does bottleneck on latency.

However, this is useless with clang. It doesn't even add and store in the same order it loaded so it's not even doing good software pipelining!

# RISC-V clang (trunk) -O3
dec_mem_gnu(unsigned char*):
        lb      a6, 7(a0)
        lb      a7, 6(a0)
        lb      t0, 5(a0)
...
        addi    t1, a5, -1
        addi    t2, a1, -1
        addi    t3, a2, -1
...
        sb      a2, 7(a0)
        sb      a1, 6(a0)
        sb      a5, 5(a0)
...
        ret

score 13 · Answer 3 · answered Jan 08 '20 at 00:19

13

I'd point out that the code you've written does actually vectorize once you start dealing with more than a single uint64_t.

https://godbolt.org/z/J9DRzd

answered Jan 08 '20 at 00:19

robthebloke

9,331
9
12

1

Could you explain or give a reference to what is happening there? It seems pretty interesting. – n314159 Jan 08 '20 at 00:31
2

I was trying to do this without SIMD instructions but I found this interesting none the less :) – cam-white Jan 08 '20 at 00:38
8

On the other hand, that SIMD code is awful. The compiler completely misunderstood what is happening here. E: it's an example of "this was clearly done by a compiler because no human would be this stupid" – harold Jan 08 '20 at 00:45
@harold: If compiler writers were to invest the time they spend trying to vectorize C code into trying to define standard ways of requesting vector operations, the result might be less "impressive" than what they come up with, but far more *useful*. – supercat Jan 08 '20 at 17:50
@supercat: you mean like GNU C's portable vector syntax? https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html already exists. Although it can compile inefficiently on targets without hardware SIMD vectors as wide as you use in the source, but it does exist. – Peter Cordes Jan 08 '20 at 18:07
1

@PeterCordes: I was thinking more along the lines of a `__vector_loop(index, start, past, pad)` construct which an implementation could treat as `for(index=start; index – supercat Jan 08 '20 at 18:58
...of the chunk may be processed or not, at the compiler's leisure. Such a construct would make it easy for compilers to find many vectorization opportunities, including many that cannot be expressed using existing C semantics. – supercat Jan 08 '20 at 19:00
@PeterCordes: Compilers spend a lot of effort trying to find ways of converting loops into sequences of short-vector operations; a looping construct like I described would allow better results to be achieved more easily, without requiring that programmers sacrifice compatibility with existing compilers. – supercat Jan 08 '20 at 19:05
@supercat: ah right, Cray-style SIMD that the compiler can implement however it wants, not modern CPU-style short-vector SIMD. Yup that would make some sense, although `int *__restrict array` can already tell the compiler that arrays don't overlap. The problem compilers run into with narrower integer types is that C defines narrow ops as promoting to `int`, and compilers sometimes make asm that *actually* does that, unpacking and repacking. Your proposal wouldn't avoid *that* pitfall, I don't think. And IDK why clang is being insane here, isolating each byte for a `paddq`. – Peter Cordes Jan 08 '20 at 19:05
1

@PeterCordes: While `restrict` is helpful (and would be more helpful if the Standard recognized a concept of "at least potentially based on", and then defined "based on" and "at least potentially based on" straightforwardly without goofy and unworkable corner cases) my proposal would also allow a compiler to perform more executions of the loop than requested--something which would greatly simplify vectorization, but for which the Standard makes no provision. – supercat Jan 08 '20 at 19:36

score 11 · Answer 4 · answered Jan 08 '20 at 16:49

11

You can make sure the subtraction doesn't overflow and then fix up the high bit:

uint64_t sub(uint64_t arg) {
    uint64_t x1 = arg | 0x80808080808080;
    uint64_t x2 = ~arg & 0x80808080808080;
    // or uint64_t x2 = arg ^ x1; to save one instruction if you don't have an andnot instruction
    return (x1 - 0x101010101010101) ^ x2;
}

answered Jan 08 '20 at 16:49

Falk Hüffner

4,942
19
25

I think it works for all 256 possible values of a byte; I put it on Godbolt (with RISC-V clang) https://godbolt.org/z/DGL9aq to look at constant-propagation results for various inputs like 0x0, 0x7f, 0x80, and 0xff (shifted into the middle of the number). Looks good. I think the top answer boils down to the same thing, but it explains it in a more complicated way. – Peter Cordes Jan 08 '20 at 18:44
Compilers could do a better job constructing constants in registers here. clang spends a lot of instructions constructing `splat(0x01)` and `splat(0x80)`, instead of getting one from the other with a shift. Even writing it that way in the source https://godbolt.org/z/6y9v-u doesn't hand-hold the compiler into making better code; it just does constant propagation. – Peter Cordes Jan 08 '20 at 18:46
I wonder why it doesn't just load the constant from memory; that's what compilers for Alpha (a similar architecture) do. – Falk Hüffner Jan 08 '20 at 19:17
GCC for RISC-V *does* load constants from memory. Looks like clang needs some tuning, unless data-cache misses are expected and are expensive compared to instruction throughput. (That balance can certainly have changed since Alpha, and presumably different implementations of RISC-V are different. Compilers could also do much better if they realized that it was a repeating pattern they could shift/OR to widen after starting with one LUI/add for 20+12 = 32 bits of immediate data. AArch64's bit-pattern immediates could even use these as immediates for AND/OR/XOR, smart decode vs. density choice) – Peter Cordes Jan 08 '20 at 19:47
Added [an answer](https://stackoverflow.com/questions/59637731/subtracting-packed-8-bit-integers-in-an-64-bit-integer-by-1-in-parallel-swar-wi/59654202#59654202) showing GCC's native-vector SWAR for RISC-V – Peter Cordes Jan 08 '20 at 21:43

n314159 · Answer 5 · 2020-01-08T00:41:23.110

Not sure if this is what you want but it does the 8 subtractions in parallel to each other:

#include <cstdint>

constexpr uint64_t mask = 0x0101010101010101;

uint64_t sub(uint64_t arg) {
    uint64_t mask_cp = mask;
    for(auto i = 0; i < 8 && mask_cp; ++i) {
        uint64_t new_mask = (arg & mask_cp) ^ mask_cp;
        arg = arg ^ mask_cp;
        mask_cp = new_mask << 1;
    }
    return arg;
}

Explanation: The bitmask starts with a 1 in each of the 8-bit numbers. We xor it with our argument. If we had a 1 in this place, we subtracted 1 and have to stop. This is done by setting the corresponding bit to 0 in new_mask. If we had a 0, we set it to 1 and have to do the carry, so the bit stays 1 and we shift the mask to the left. You better check for yourself if the generation of the new mask works as intended, I think so, but a second opinion would not be bad.

PS: I am actually unsure if the check on mask_cp being not null in the loop may slows the program down. Without it, the code would still be correct (since the 0 mask just does nothing) and it would be much easier for the compiler to do loop unrolling.

`for` won't run in parallel, are you confused with `for_each`? — LTPCGO, Jan 08 '20 at 00:35
@LTPCGO No, it is not my intention to parallelize this for loop, this would actually break the algorithm. But this code works on the different 8bit integers in the 64bit integer in parallel, i.e. all 8 subtractions are done simultaneously but they need up to 8 steps. — n314159, Jan 08 '20 at 00:38
I realise what I was asking might have been a bit unreasonable but this was pretty close to what I needed thanks :) — cam-white, Jan 08 '20 at 00:40

score 4 · Answer 6 · answered Jan 08 '20 at 00:14

4

int subtractone(int x) 
{
    int f = 1; 

    // Flip all the set bits until we find a 1 at position y
    while (!(x & f)) { 
        x = x^f; 
        f <<= 1; 
    } 

    return x^f; // return answer but remember to flip the 1 at y
}

You can do it with bitwise operations using the above, and you just have to divide your integer into 8 bit pieces to send 8 times into this function. The following part was taken from How to split a 64-bit number into eight 8-bit values? with me adding in the above function

uint64_t v= _64bitVariable;
uint8_t i=0,parts[8]={0};
do parts[i++] = subtractone(v&0xFF); while (v>>=8);

It is valid C or C++ regardless of how someone comes across this

answered Jan 08 '20 at 00:14

LTPCGO

448
1
4
15

5

This doesn't parallelize the work though, which is OP's question. – nickelpro Jan 08 '20 at 00:16
Yeah @nickelpro is right, this would do each subtraction one after the other, I would like to subtract all 8-bit integers at the same time. I do appreciate the answer tho thanks bro – cam-white Jan 08 '20 at 00:19
2

@nickelpro when I started the answer the edit hadn't been made *which stated the parallel part of the question* and so I didn't notice it until after submission, will leave up in case it is useful for others as it at least answers the part to do bitwise operations and it could be made to work in parallel by utilising `for_each(std::execution::par_unseq,...` instead of whiles – LTPCGO Jan 08 '20 at 00:26
2

It's my bad, I submitted the question then realised I didn't say it needed to be in parallel so edited – cam-white Jan 08 '20 at 00:31

score 2 · Answer 7 · answered Jan 09 '20 at 22:44

2

Not going to try to come up with the code, but for a decrement by 1 you could decrement by the group of 8 1s and then check to be sure that the LSBs of the results had "flipped". Any LSB that hasn't toggled indicates that a carry occurred from the adjacent 8 bits. It should be possible to work out a sequence of ANDs/ORs/XORs to handle this, without any branches.

answered Jan 09 '20 at 22:44

Hot Licks

47,103
17
93
151

That might work, but consider the case where a carry propagates all the way through one group of 8 bits and into another. The strategy in the good answers (of setting the MSB or something first) to ensure carry doesn't propagate is probably at least as efficient as this could be. The current target to beat (i.e. the good non-looping branchless answers) is 5 RISC-V asm ALU instructions with instruction-level parallelism making the critical path only 3 cycles, and using two 64-bit constants. – Peter Cordes Jan 09 '20 at 23:11

score 0 · Answer 8 · 2020-01-08T23:56:15.247

0

Focus work on each byte fully alone, then put it back where it was.

uint64_t sub(uint64_t arg) {
   uint64_t res = 0;

   for (int i = 0; i < 64; i+=8) 
     res += ((arg >> i) - 1 & 0xFFU) << i;

    return res;
   }

edited Jan 08 '20 at 23:56

answered Jan 08 '20 at 03:12

Subtracting packed 8-bit integers in an 64-bit integer by 1 in parallel, SWAR without hardware SIMD

8 Answers8

The operation can be written as:

Testcases:

Performance details

GNU C native vector syntax example: