What is the fastest way to byte swap a 64 bit value in memory?
The mov/bswap/mov
version and the movbe/mov
are about the same on most Intel processors. Based on the µop count, it seems movbe
decodes to mov + bswap
, except on Atom. For Ryzen, movbe
may be better. Manually swapping around bytes is much slower, except in certain edge cases where a large load/store is very slow, such as when it crosses a 4K boundary pre-Skylake.
pshufb
is a reasonable option even to replace a single bswap
, though that wastes half of the work the shuffle could do.
PS: My real situation is a bit more complicated. I do have a large byte array, but it contains differently sized integers, all densely packed.
In this general case, with sizes dynamically taken from an other data stream, a new big issue is branching on the size. Even in scalar code that can be avoided, by byte-reversing a 64bit block and shifting it right by 8 - size
, then merging it with the un-reversed bytes, and advancing by size
. That could be worked out, but it's a waste of time to try that, the SIMD version will be better.
A SIMD version could use pshufb
and a table of a shuffle-masks indexed by a "size pattern", for example an 8-bit integer where every 2 bits indicates the size of an element. pshufb
then reverses the elements that are wholly contained in the 16-byte window that it's looking at, and leave the rest alone (those unchanged bytes at the tail will be written back too, but that's OK). Then we advance by the number of bytes that was actually processed.
For maximum convenience, those size patterns (as well as corresponding byte-counts) should be supplied in such a way that the actual Endianness Flipper itself can consume exactly one of them per iteration, without anything heavy such as extracting a byte-unaligned sequence of 8 bits and determining dynamically how many bits to consume. That's also possible, but at a significantly higher cost. About 4x as slow in my test, limited by the loop-carried dependency through "extract 8 bits at current bit-index" through "find bit-index increment by table lookup" and then into the next iteration: about 16 cycles per iteration, though still in 60% of the time that equivalent scalar code took.
Using an unpacked (1 byte per size) representation would make the extraction easier (just an unaligned dword load), but requires packing the result to index the shuffle mask table with, for example with pext
. That would be reasonable for Intel CPUs, but pext
is extremely slow on AMD Ryzen. A alternative that is fine for both AMD and Intel would be to do the unaligned dword read, then extract the 8 interesting bits using a multiply/shift trick:
mov eax, [rdi]
imul eax, eax, 0x01041040
shr eax, 24
An extra trick that should be used, in the Convenient Input case at least (otherwise we're stuck with a 5x worse performance anyway and this trick won't be relevant), is reading the data for the next iteration before storing the result of the current iteration. Without that trick, the store will often "step on the toes" of the load of the next iteration (because we advance less than 16 bytes, so the load reads some of the bytes that the store left unchanged but had to write anyway), forcing a memory dependency between them which hold up the next iteration. The performance difference is large, about 3x.
Then the Endianness Flipper could look something like this:
void flipEndiannessSSSE3(char* buffer, size_t totalLength, uint8_t* sizePatterns, uint32_t* lengths, __m128i* masks)
{
size_t i = 0;
size_t j = 0;
__m128i data = _mm_loadu_si128((__m128i*)buffer);
while (i < totalLength) {
int sizepattern = sizePatterns[j];
__m128i permuted = _mm_shuffle_epi8(data, masks[sizepattern]);
size_t next_i = i + lengths[j++];
data = _mm_loadu_si128((__m128i*)&buffer[next_i]);
_mm_storeu_si128((__m128i*)&buffer[i], permuted);
i = next_i;
}
}
For example, Clang 10 with -O3 -march=haswell
turns that into
test rsi, rsi
je .LBB0_3
vmovdqu xmm0, xmmword ptr [rdi]
xor r9d, r9d
xor r10d, r10d
.LBB0_2: # =>This Inner Loop Header: Depth=1
movzx eax, byte ptr [rdx + r10]
shl rax, 4
vpshufb xmm1, xmm0, xmmword ptr [r8 + rax]
mov eax, dword ptr [rcx + 4*r10]
inc r10
add rax, r9
vmovdqu xmm0, xmmword ptr [rdi + rax]
vmovdqu xmmword ptr [rdi + r9], xmm1
mov r9, rax
cmp rax, rsi
jb .LBB0_2
.LBB0_3:
ret
LLVM-MCA thinks that takes about 3.3 cycles per iteration, on my PC (4770K, tested with a uniform mix of 1, 2, 4 and 8 byte sized elements) it was a little slower, closer to 3.7 cycles per iteration, but that's still good: that's just under 1.2 cycles per element.