First of all, it might (or might not) be worth doing this on the fly while reading from input[]
in some other function. If there's any kind of pattern to the shuffle, it might not be too bad. OTOH, it might defeat prefetching in whatever you feed this array to.
Have you tried using __restrict__
to tell the compiler that accesses through one pointer can't alias with accesses through the other pointer? If it can't prove that on its own, it's not allowed to combine loads or stores when the source interleaves them like that.
restrict
is a C99 feature that's not included in ISO C++, but the common C++ compilers support __restrict__
or __restrict
as an extension. Use a CPP macro to #define __restrict__ __restrict
on MSVC, or to the empty string on compilers that don't support any equivalent.
gcc sucks at load/store coalescing, but clang doesn't.
It's been a long-standing bug in gcc that it's bad at coalescing loads/stores (see bugzilla link, but I think I remember seeing another bug report from even longer ago, like from gcc4.0 or something). It's more usually encountered with struct-copying (generating member-by-member loads/stores), but it's the same issue here.
With __restrict__
, clang is able to coalesce most of the loads/stores in your example function into xmm or ymm vectors. It even generates a vector load and scalar vpextrd
of the last three elements! See the code + asm output on the Godbolt compiler explorer, from clang++3.8 -O3 -march=haswell
.
With the same source, g++6.1 still totally fails to coalesce anything, even the contiguous zeros. (try flipping the compiler to gcc on godbolt). It even does a poor job with a small memcpy, not using SIMD even though we compile with -march=haswell
where unaligned vectors are very cheap. :/
If there's any kind of pattern, taking advantage of that in the reorder()
function to save code-size will help significantly. Even if loads/stores coalesce into SIMD vectors, you'll still blow out the uop cache and L1 instruction cache. Code-fetch will be competing with data load/store for L2 bandwidth. Once your array indices get too big for an 8-bit displacement, the size of each instruction will get even bigger. (opcode bytes + ModRM byte + disp32). If it's not going to coalesce, it's too bad that gcc doesn't optimize these moves into 32-bit mov
instructions (1 opcode byte) instead of movss
(3 opcode bytes)
So after this function returns, the rest of your program will run slower than normal for a very short time, since the 32kiB L1 instruction cache and the even smaller uop cache will be cold (full of mov
instructions from the bloated reorder function). Use perf counters to look at I-cache misses. See also the x86 tag wiki to learn more about x86 performance, especially Agner Fog's guides.
As you suggested in comments, calloc
is a good way to avoid the zeroing parts when you need a fresh output buffer. It does take advantage of the fact that new pages from the OS start out zeroed anyway (to avoid information leaks). It's better to reuse an existing buffer than to free it and calloc a new one, though, because the old one will still be hot in cache and/or TLB. And at least the pages will still be wired, instead of having to be faulted in when you first touch them.
Using memcpy
and memset
instead of per-element assignments might well help your compile-times. If the source is very repetitive, you can probably write something in perl (or the text-manipulation language of your choice) to turn contiguous runs of copying into memset
calls.
If there any big runs (like 128 bytes or more), the ideal asm is rep movsd
(or rep movsq
) on many CPUs, especially recent Intel. gcc usually inlines memcpy to rep movs
instead of calling library memcpy
when the size is known at compile-time, and you can even tune the strategy (SIMD vs. rep movs) with -mstringop-strategy
. The savings in code-size will probably be a significant benefit for you, if there's no pattern that lets you code this as a loop.
If your pattern allows it, it's probably worth copying a larger contiguous block and then going back to zero or copy something else into a few elements, because rep movs
has significant startup overhead but extremely good performance once it's up and running. (Avoiding read-for-ownership overhead when it's storing an entire cache lines on Intel CPUs, even before IvB's fast rep movsb feature according to Andy Glew (who implemented it in P6).)
If you can't just compile this object file with clang instead of gcc, maybe you should look at generating the asm for it yourself. If this slows down your program significantly (and copying around that much memory + nuking the instruction cache might well do that), then some text-processing could convert a list of ranges into asm that sets up rsi
, rdi
, and ecx
for rep movsd
.
Reading the input in order might give better performance than writing the output in order. Cache-miss stores usually have less impact on the pipeline than cache-miss loads. OTOH, doing all the stores for a single cache-line together is probably a good thing. Worth playing with if this is a significant bottleneck.
If you did use intrinsics, it might be worth using NT stores for the parts of a contiguous run that cover a whole cache lines (64B size, 64B aligned), if your array is really big. Or maybe do the stores in sequential order, using NT stores?
NT loads might be a good idea, but IDK if the NT hint does anything at all on normal write-back memory. They're not weakly-ordered, but there are ways they could cause less cache pollution (see that link for my guess).
Doing the shuffle in-place might be a good idea, if that's useful for your program. Since the output includes some runs of zeroes, I assume it's longer than the input. In that case, doing it in-place is probably easiest if you start from the end of the array. I suspect that an in-place shuffle isn't what you want, so I won't say too much more about it.