What's the most efficient way to swap 4 16-bit integers on a 64-bit processor?

Question

I have four uint16 named a,b,c,d respectively, and now I'd like to swap them like this:

void swap4(uint16_t &a, uint16_t &b, uint16_t &c, uint16_t &d) {
    uint16_t temp = a;
    a = b;
    b = c;
    c = d;
    d = temp;
}

Is there anything I can do to speed up this procedure?

Why do you think this isn't quick enough? Did you profile your code? What was the result? — dbush, Aug 23 '22 at 17:49
Before you spend too much time on it, make sure that it's not fast enough already. It's amazing what a good optimizing compiler can do to a simple, and easy-to-understand function. Often the simpler it is, the better a job the compiler does because even the compiler benefits from easy-to-read code. — user4581301, Aug 23 '22 at 17:50
If you are looking for an assembly solution, what architecture are you programming for? — fuz, Aug 23 '22 at 18:45
each "64 bit processor" architecture is different and no reason to expect one high level language solution to fit all optimizers. if you have a performance problem, then confirm it is a performance problem and not prematurely optimize, but once it is identified then hand optimize for the target, using profiling of the solutions. — old_timer, Sep 29 '22 at 13:55

273K · Answer 1 · 2022-08-23T18:04:44.603

3

In C++ this is

void swap4(uint16_t &a, uint16_t &b, uint16_t &c, uint16_t &d) {
    std::tie(a, b, c, d) = std::make_tuple(b, c, d, a);
}

edited Aug 23 '22 at 18:04

answered Aug 23 '22 at 17:53

273K

29,503
10
41
64

3

I doubt it is any more optimal, but it sure looks nicer. As an aside, thanks to CTAD `std::tuple()` would also work instead of `std::make_tuple()`. – Deduplicator Aug 23 '22 at 18:17
@Deduplicator If you doubt, look at the assmbler https://godbolt.org/z/doPrMr53r – 273K Aug 23 '22 at 18:25
2

The code has slightly different semantics if there is aliasing. Without, as this does all loads first is won't (or not so much) stall for memory-access. – Deduplicator Aug 23 '22 at 18:41
*if there is aliasing* This is up to a caller, not mentioned in the question. As for me, when it loads all then assigns looks safer. – 273K Aug 23 '22 at 18:45
2

Yep, the contract is up to OP, and he probably over-specified. Interestingly, the extension `__restrict__` [doesn't help](https://godbolt.org/z/hdv3cxh1W), which is definitely a performance bug. – Deduplicator Aug 23 '22 at 18:45
2

Not sure if `__restrict__` can work with references, maybe pointers would be optimized better. – 273K Aug 23 '22 at 18:47
Well, references are just non-nullable pointers for the optimizer, and for that extension. Anyway, I think we squeezed the maximum fun out of this by now. – Deduplicator Aug 23 '22 at 18:48

Deduplicator · Answer 2 · 2022-08-23T18:51:24.833

Well, the code as-is should be compiled to the optimal sequence of assembly instructions, minimal optimization level assumed. That is, if aliasing is allowed, otherwise see 273k who loads all values before than writing them all.

But it is so short, and has so much indirection, that that isn't where all the optimization potential is anyway.

Inlining and optimizing the resulting bigger chunk has a far bigger impact. Also, it will allow the compiler to see whether aliasing happens.

Good new is, if you allow the compiler, it will get that done, be it thanks to lto, whole-program-optimization, or the like (so depends on calling the compiler right), or the definition being in the same TU as the call (which might call for putting it as an inline-function in the header, or marking it static).

In general I agree with the spirit of this answer, though I don't like "Just believe in the compiler" as far as answers go when discussing efficiency and code-generation. At least, not without footnotes on _which_ compilers/architectures/optimizations, and preferably with actual data. There are a lot more compilers out there than just Clang, GCC, or MSVC -- all of which have weird quirks and idiosyncrasies. — Human-Compiler, Aug 23 '22 at 19:44

spillner · Answer 3 · 2022-08-23T19:24:02.210

2

As noted, first make sure this is really a bottleneck: most compilers should generate efficient code for this (unless there's a possibility of aliasing between the arguments).

If it happens that these 16-bit values are stored contiguously in memory (e.g. this is a four-element vector), then (a) make sure they're aligned on the right boundary! and (b) you can use your CPU's shuffle instruction, which is an optimization that your compiler might or might not recognize on its own. Before you go any further, check your compiler's assembly output; modern GCC with -O2 does in fact automatically recognize this simplification (https://godbolt.org/z/qo1jxnbds).

If you really want to hand-roll, GCC provides a portable __builtin_shuffle macro for this; for your use case you could write

typedef uint16_t quadword __attribute__ ((vector_size (8)));
quadword input = {a, b, c, d};
const quadword rotate_mask = {1, 2, 3, 0};
quadword output = __builtin_shuffle (input, rotate_mask);

(You probably don't want to write exactly that, but to recast your data as an array of those quadword types--- see the Compiler Explorer link above for an example.)

For x86 the underlying instruction generated by this macro is pshufb/pshufw, which (if you're not on GCC, or don't want to be portable) you could access with the _mm_shuffle_pi16 intrinsic (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=shuffle&techs=MMX,SSE&ig_expand=6426). Every modern RISC architecture offers something similar.

edited Aug 23 '22 at 19:24

answered Aug 23 '22 at 19:03

spillner

340
1
5

From Compiler Explorer, it looks like x86 GCC didn't implement __builtin_shuffle well until version 11, and didn't automatically convert loops not using the builtin to pshuflw instructions until version 12. – spillner Aug 23 '22 at 19:32
It seems to still optimize correctly when using `-O3`, at least as early as GCC 6.1. Interestingly enough, some of the older GCC compilers generate larger assembly for the intrinsic than letting GCC detect the optimization on its own: https://godbolt.org/z/5ch99hneq – Human-Compiler Aug 23 '22 at 19:37
If they're contiguous, you don't need SIMD, just `ror qword ptr [rdi], 16`. That could be expressed in C with `memcpy` into a `uint64_t` and [Best practices for circular shift (rotate) operations in C++](https://stackoverflow.com/q/776508). Of course that might force the compiler into actually storing / reloading if you had 4 variables in a loop that the compiler could others be keeping in registers, maybe just unrolling to make the swapping free, or shorten the latency chain to just a mov each. – Peter Cordes Aug 24 '22 at 05:03
@PeterCordes `ror` would of course be x86 architecture. We still have not heard back, which one the OP uses. If the compiler itself creates it from `memcpy` all the better. – Sebastian Aug 24 '22 at 06:08
1

@Sebastian: Yes, of course; most architectures have an integer rotate instruction. (MIPS and RISC-V without extension-B being notable exceptions). This answer's Godbolt link also happened to target x86, so it was a reasonable choice for my example. – Peter Cordes Aug 24 '22 at 06:32

What's the most efficient way to swap 4 16-bit integers on a 64-bit processor?

3 Answers3