Vectorized memcpy that beats _intel_fast_memcpy?

Question

I have implemented a SSE4.2 version of memcpy, but I cannot seem to beat _intel_fast_memcpy on Xeon V3. I use my routine in a gather routine in which the data varies between 4 to 15 bytes at each location. I've looked at many posts here and on intel's website with no luck. What is a good source I should look at?

I'm not sure I understand *why* you believe you can beat an optimized routine written by engineers with access to the designers of the silicon. — Zan Lynx, Aug 25 '16 at 20:12
And a suggestion: if you can just pad all your data in every location to 16 bytes your memcpy becomes a fixed 16 byte copy at every location. No need to check the length or use unaligned access. — Zan Lynx, Aug 25 '16 at 20:14
And back to memcpy: even non-obvious things like the number of bytes between branch and branch targets, or the alignment of instructions can affect branch prediction and i-cache hit rates. Copy their code exactly, then start tweaking it, see where the performance drops off. — Zan Lynx, Aug 25 '16 at 20:17
I did not believe I could beat it, but it was a surprise when my custom implementation was on par with intel's. Padding would be very expensive since the data is big. Currently, I use maskload and maskstore with predefined masks. Since these only support 32bits operations, I have to take care of left-overs, which is why my code is on par... — nineties, Aug 25 '16 at 20:28
So what's your question? Is _intel_fast_memcpy not fast enough? If not, why not? — Rob K, Aug 25 '16 at 20:42
The gather operation takes a big chunk out of the runtime, so I was thinking if I can use simd to help speeding it up since I know my data size range? I'm asking if you know of any good source? — nineties, Aug 25 '16 at 20:57
There are lots of performance links in the [x86 tag wiki](http://stackoverflow.com/tags/x86/info), especially [Agner Fog's stuff](http://agner.org/optimize). When you say maskload and maskstore, you mean [the AVX versions (`VPMASKMOV`)](http://www.felixcloutier.com/x86/VPMASKMOV.html), not the slow byte-granularity [SSE version (`MASKMOVDQU`)](http://www.felixcloutier.com/x86/MASKMOVDQU.html) with the NT hint, right? — Peter Cordes, Aug 26 '16 at 00:00
@PeterCordes, yes, I was using avx instruction for maskload and maskstore. I somehow did not see the MASKMOVDQU you mentioned. It is exactly what I need. Is there something similar for loading from memory as I could not find any on intel's intrinsics guide — nineties, Aug 26 '16 at 03:32
Don't use `maskmovdqu`. It's *very* slow, because it bypasses the cache. (it's an NT store). — Peter Cordes, Aug 26 '16 at 03:38

score 3 · Accepted Answer · edited May 23 '17 at 11:51

Can you do your gathers with a 16B load and store, and then just overlap however many garbage bytes were at the end?

// pseudocode: pretend these intrinsics take void* args, not float
char *dst = something;
__m128 tmp = _mm_loadu_ps(src1);
_mm_storeu_ps(dst, tmp);
dst += src1_size;

tmp = _mm_loadu_ps(src2);
_mm_storeu_ps(dst, tmp);
dst += src2_size;

...

Overlapping stores are efficient (and the L1 cache soaks them up just fine), and modern CPUs should handle this well. Unaligned loads/stores are cheap enough that I don't think you can beat this. (Assuming an average amount of page-split loads. Even if you had more than an average amount of cache-line split loads, it probably won't be a problem, though.)

This means no conditional branches inside the inner loop to decide on a copying strategy, or any mask generation or anything. All you need is an extra up to 12B or something at the end of your gather buffer in case the last copy was only supposed to be 4B. (You also need the elements you're gathering to not be within 16B of the end of a page, where the following page is unmapped or not readable.)

If reading past the end of the elements you're gathering is a problem, then maybe vpmaskmov for the loads will actually be a good idea. If your elements are 4B-aligned, then it's always fine to read up to 3 bytes beyond the end. You can still use a normal 16B vector store into your dst buffer.

I used _ps loads because movups is 1 byte shorter than movupd or movdqu, but performs the same (see Agner Fog's microarch pdf, and other links in the x86 tag wiki. (clang will even use movaps / movups for _mm_store_si128 sometimes.)

re: your comment: Don't use legacy SSE maskmovdqu. The biggest problem is that it only works as a store, so it can't help you avoid reading outside the objects you're gathering. It's slow, and it bypasses the cache (it's an NT store), making it extremely slow when you come to reload this data.

The AVX versions (vmaskmov and vpmaskmov) aren't like that, so converting your code to use maskmovdqu would probably be a big slowdown.

Related: I posted a Q&A about using vmovmaskps for the end of unaligned buffers a while ago. I got some interesting responses. Apparently it's not usually the best way to solve any problem, even though my (clever IMO) strategy for generating a mask was pretty efficient.

MOVMASKPS is very much one of those "it seemed like a good idea at the time" things AFAICT. I've never used it. – Stephen Canon

Vectorized memcpy that beats _intel_fast_memcpy?

1 Answers1