What is the fastest way to swap the bytes of an unaligned 64 bit value in memory?

Question

I have a large number of 64 bit values in memory. Unfortunately they might not be aligned to 64 bit addresses. My goal is to change the endianess of all those values, i.e. swapping/reversing their bytes.

I know about the bswap instruction which swaps the bytes of a 32 or 64 bit register. But as it requires a register argument, I cannot pass it my memory address. Of course I can first load the memory into register, then swap, then write it back:

mov rax, qword [rsi]
bswap rax
mov qword [rsi], rax

But is that even correct, given that the address might be unaligned?

Another possibility is to do the swaps manually:

mov al, byte [rsi + 0]
mov bl, byte [rsi + 7]
mov byte [rsi + 0], bl
mov byte [rsi + 7], al

mov al, byte [rsi + 1]
mov bl, byte [rsi + 6]
mov byte [rsi + 1], bl
mov byte [rsi + 6], al

mov al, byte [rsi + 2]
mov bl, byte [rsi + 5]
mov byte [rsi + 2], bl
mov byte [rsi + 5], al

mov al, byte [rsi + 3]
mov bl, byte [rsi + 4]
mov byte [rsi + 3], bl
mov byte [rsi + 4], al

That's obviously a lot more instructions. But is it slower, too?

But all in all I'm still pretty inexperienced in x86-64, so I wonder: What is the fastest way to byte swap a 64 bit value in memory? Is one of the two options I described optimal? Or is there a completely different approach that is even faster?

PS: My real situation is a bit more complicated. I do have a large byte array, but it contains differently sized integers, all densely packed. Some other array tells me what size of integer to expect next. So this "description" could say "one 32 bit int, two 64 bit ints, one 16 bit int, then one 64 bit int again". I am just mentioning this here to tell you that (as far as I can tell), using SIMD instructions is not possible as I actually have to inspect the size of each integer before reading.

Your first approach doesn't have any problem with correctness. x86 allows unaligned memory access for all scalar instructions including `mov`; they just may be slower than aligned accesses. If you need to do other stuff with the value as you said, then you probably can't avoid the load and store anyway, so `bswap` seems like the way to go. — Nate Eldredge, Jun 14 '20 at 18:53
See also https://stackoverflow.com/questions/12491578/whats-the-actual-effect-of-successful-unaligned-accesses-on-x86 — Nate Eldredge, Jun 14 '20 at 18:54
@NateEldredge Thanks already! "If you need to do other stuff with the value" -> I was a bit imprecise: I do need to do other stuff, but not with that value. I just need to swap the value in memory and don't need to do anything else *with it*. — Lukas Kalbertodt, Jun 14 '20 at 18:59
How about two instructions: a `mov` then a `movbe`, or vice versa? (If you have it) — Joseph Sible-Reinstate Monica, Jun 14 '20 at 19:00
@NateEldredge I edited my question to hopefully make my SIMD point a bit clearer. — Lukas Kalbertodt, Jun 14 '20 at 19:06
@JosephSible-ReinstateMonica Thanks, the `movbe` instruction sounds great! That sure sounds like the optimal solution. Of course, if someone knows an even better one, let me know! — Lukas Kalbertodt, Jun 14 '20 at 19:09
Using SIMD might be possible. You need the sizes, but PSHUFB with a shuffle mask from a table indexed by "size pattern" is very flexible. The details depend on what you're actually dealing with in terms of representation, so if you'd add that then we can look into that — harold, Jun 14 '20 at 19:12
@harold Interesting, thanks! But I guess figuring that SIMD stuff out is worth a new question, so I might create one later. — Lukas Kalbertodt, Jun 14 '20 at 19:17
Was going to say the same as harold: a lookup table of pshufb control vectors based on two consecutive elements could be a good tradeoff between table size and speed, and works even if the next two elements are both 64-bit (unlike 3 or more because that might not fit in a 16-byte register). For inspiration, see [Fastest way to get IPv4 address from string](https://stackoverflow.com/a/31683632) and [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) — Peter Cordes, Jun 14 '20 at 20:38
@JosephSible-ReinstateMonica: Fun fact: Atom has single-uop load-and-swap, but `movbe` decodes to 2 uops on mainstream CPUs, exactly like `mov`+`bwap`. Or 3 uops for 64-bit operand-size: byte swapping takes 2 uops on Intel CPUs for wider than 32 bits, so `movq` / `pshufb` might be best for one 64-bit integer if front-end throughput is a bottleneck. https://www.uops.info/table.html?search=movbe&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_HSW=on&cb_SKX=on&cb_ICL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_bmi=on&cb_sse=on&cb_others=on. — Peter Cordes, Jun 14 '20 at 20:46
`movbe` is only available on Intel since Haswell IIRC, but `bswap` is baseline for x86-64. And `pshufb` is SSSE3, available almost everywhere. (Core 2 and Bulldozer) — Peter Cordes, Jun 14 '20 at 20:50
If one of you fine people could write an answer with all this information in the comments, that'd be great! You could also show the SIMD code for the case that one *does* have a simple byte array full of 64 bit ints -- it's probably helpful for others. — Lukas Kalbertodt, Jun 15 '20 at 12:54

harold · Accepted Answer · 2020-06-16T01:31:11.000

What is the fastest way to byte swap a 64 bit value in memory?

The mov/bswap/mov version and the movbe/mov are about the same on most Intel processors. Based on the µop count, it seems movbe decodes to mov + bswap, except on Atom. For Ryzen, movbe may be better. Manually swapping around bytes is much slower, except in certain edge cases where a large load/store is very slow, such as when it crosses a 4K boundary pre-Skylake.

pshufb is a reasonable option even to replace a single bswap, though that wastes half of the work the shuffle could do.

PS: My real situation is a bit more complicated. I do have a large byte array, but it contains differently sized integers, all densely packed.

In this general case, with sizes dynamically taken from an other data stream, a new big issue is branching on the size. Even in scalar code that can be avoided, by byte-reversing a 64bit block and shifting it right by 8 - size, then merging it with the un-reversed bytes, and advancing by size. That could be worked out, but it's a waste of time to try that, the SIMD version will be better.

A SIMD version could use pshufb and a table of a shuffle-masks indexed by a "size pattern", for example an 8-bit integer where every 2 bits indicates the size of an element. pshufb then reverses the elements that are wholly contained in the 16-byte window that it's looking at, and leave the rest alone (those unchanged bytes at the tail will be written back too, but that's OK). Then we advance by the number of bytes that was actually processed.

For maximum convenience, those size patterns (as well as corresponding byte-counts) should be supplied in such a way that the actual Endianness Flipper itself can consume exactly one of them per iteration, without anything heavy such as extracting a byte-unaligned sequence of 8 bits and determining dynamically how many bits to consume. That's also possible, but at a significantly higher cost. About 4x as slow in my test, limited by the loop-carried dependency through "extract 8 bits at current bit-index" through "find bit-index increment by table lookup" and then into the next iteration: about 16 cycles per iteration, though still in 60% of the time that equivalent scalar code took.

Using an unpacked (1 byte per size) representation would make the extraction easier (just an unaligned dword load), but requires packing the result to index the shuffle mask table with, for example with pext. That would be reasonable for Intel CPUs, but pext is extremely slow on AMD Ryzen. A alternative that is fine for both AMD and Intel would be to do the unaligned dword read, then extract the 8 interesting bits using a multiply/shift trick:

mov eax, [rdi]
imul eax, eax, 0x01041040
shr eax, 24

An extra trick that should be used, in the Convenient Input case at least (otherwise we're stuck with a 5x worse performance anyway and this trick won't be relevant), is reading the data for the next iteration before storing the result of the current iteration. Without that trick, the store will often "step on the toes" of the load of the next iteration (because we advance less than 16 bytes, so the load reads some of the bytes that the store left unchanged but had to write anyway), forcing a memory dependency between them which hold up the next iteration. The performance difference is large, about 3x.

Then the Endianness Flipper could look something like this:

void flipEndiannessSSSE3(char* buffer, size_t totalLength, uint8_t* sizePatterns, uint32_t* lengths, __m128i* masks)
{
    size_t i = 0;
    size_t j = 0;
    __m128i data = _mm_loadu_si128((__m128i*)buffer);
    while (i < totalLength) {
        int sizepattern = sizePatterns[j];
        __m128i permuted = _mm_shuffle_epi8(data, masks[sizepattern]);
        size_t next_i = i + lengths[j++];
        data = _mm_loadu_si128((__m128i*)&buffer[next_i]);
        _mm_storeu_si128((__m128i*)&buffer[i], permuted);
        i = next_i;
    }
}

For example, Clang 10 with -O3 -march=haswell turns that into

    test    rsi, rsi
    je      .LBB0_3
    vmovdqu xmm0, xmmword ptr [rdi]
    xor     r9d, r9d
    xor     r10d, r10d
.LBB0_2:                            # =>This Inner Loop Header: Depth=1
    movzx   eax, byte ptr [rdx + r10]
    shl     rax, 4
    vpshufb xmm1, xmm0, xmmword ptr [r8 + rax]
    mov     eax, dword ptr [rcx + 4*r10]
    inc     r10
    add     rax, r9
    vmovdqu xmm0, xmmword ptr [rdi + rax]
    vmovdqu xmmword ptr [rdi + r9], xmm1
    mov     r9, rax
    cmp     rax, rsi
    jb      .LBB0_2
.LBB0_3:
    ret

LLVM-MCA thinks that takes about 3.3 cycles per iteration, on my PC (4770K, tested with a uniform mix of 1, 2, 4 and 8 byte sized elements) it was a little slower, closer to 3.7 cycles per iteration, but that's still good: that's just under 1.2 cycles per element.

I think it's worth pointing out that `bswap r64` is 2 uops on Intel CPUs, so `movq` / `pshufb` / `movq` would actually be faster for a single 64-bit integer. At least better front-end throughput; back-end bottlenecks are unlikely because the two uops of `bswap` can run on different sets of ports. Also perhaps worth mentioning that a [page-split](https://stackoverflow.com/q/45128763) could make the byte-at-a-time version better on Intel before Skylake, but *only* in that one rare case that's not worth sacrificing huge amounts of performance for the common case. — Peter Cordes, Jun 15 '20 at 23:12
@PeterCordes I tried to compare bswap and pshufb, but it was just a bottomless abyss of trade-offs and special cases and I gave up. Also it seems that the two µops of bswap can each go to 2 different ports, but the overall reciprocal throughput is 0.67(according to uops.info) instead of 0.5 as predicted, that doesn't make sense to me. — harold, Jun 16 '20 at 01:35
I just thought mentioning a bit more stuff in general terms would be useful for the OP to understand that x86 has efficient unaligned accesses, and there's no penalty if there's no cache-line split, which will be most of the integers. Since the 2-element pshufb way seems practical, I guess not too much need to go into minutiae of bswap, but maybe for some future readers it's relevant to the title question. — Peter Cordes, Jun 16 '20 at 01:52

What is the fastest way to swap the bytes of an unaligned 64 bit value in memory?

1 Answers1