Optimizing Long Runs of Memory Reads and Writes

Question

I have a source file called reorder.cc that looks like the following:

void reorder(float *output, float *input) {
  output[56] = input[0];
  output[57] = input[1];
  output[58] = input[2];
  output[59] = input[3];
  output[60] = input[4];
  ...
  output[75] = input[19];
  output[76] = input[20];
  output[77] = input[21];
  output[78] = input[22];
  output[79] = input[23];
  output[80] = input[24];
  ...
  output[98] = 0;
  output[99] = 0;
  output[100] = 0;
  output[101] = 0;
  output[102] = 0;
  output[103] = 0;
  output[104] = 0;
  output[105] = input[1];
  output[106] = input[2];
  output[107] = input[3];
  output[108] = input[4];
  output[109] = input[5];
  output[110] = input[6];
  output[111] = 0; 
  ...
}

The function reorder, has a very long list of memory move operations from input to output buffer. The correspondence from input to output is complicated, but generally there are long enough runs of at least size 10 floats that are guaranteed to be contiguous. The runs are interrupted by a new run that starts either from an arbitrary input index, or is '0'-valued.

The associated assembly file (.S) file with (g++-6 -march=native -Ofast -S reorder.cc) generates the following assembly:

 .file "reorder.cc"
  .text
  .p2align 4,,15
  .globl  _Z9optimizedPfS_
  .type _Z9optimizedPfS_, @function
_Z9optimizedPfS_:
.LFB0:
  .cfi_startproc
  movss (%rsi), %xmm0
  movss %xmm0, 32(%rdi)
  movss 4(%rsi), %xmm0
  movss %xmm0, 36(%rdi)
  movss 8(%rsi), %xmm0
  movss %xmm0, 40(%rdi)
  movss 12(%rsi), %xmm0
  movss %xmm0, 44(%rdi)
  movss 16(%rsi), %xmm0
  movss %xmm0, 48(%rdi)
  movss 20(%rsi), %xmm0
  movss %xmm0, 52(%rdi)
  movss 28(%rsi), %xmm0
  movss %xmm0, 60(%rdi)
  movss 32(%rsi), %xmm0
  movss %xmm0, 64(%rdi)
  movss 36(%rsi), %xmm0
  ...

... which corresponds to a single move single scalar (fp32) value per assembly line. I thought the compiler was smart enough to compile to smarter instructions like MOVDQU (Move Unaligned Double Quadword which works on 128-bit words) for long enough runs?

I am considering handwriting a simple parser that takes these long runs and automatically calls movdqu, but I find this tedious, unwieldy, and error-prone.

Is there a special compiler flag that automatically detects these long runs and generates efficient instructions? Am I doomed to use intrinsics to further optimize this code, or is there a smart trick that can automatically do this bookkeeping for me?

reorder.cc is on the order of 100,000 instructions of these input,output pairs, and this is the smaller test case I'm working on.

Also, any tips on compiling large source files of 100K+ or more lines of these move instructions? g++-6 -Ofast takes hours on a 1M line file like this for a Macbook Pro i7 processor.

I replaced a single run with a memcpy, but it generated movq (64 bit moves instead of 32 bits from movss). Still looking to see if there is an automatic solution before I dive deep in the rewrites. — , Aug 25 '16 at 05:02
If your algorithm knows exactly how the data should be copied, you can split the long assignments into several separated `memcpy` and `memset`. — Mine, Aug 25 '16 at 05:23
Luckily the 0's are easy to take care. Just calloc() or memset to 0 for the whole output buffer at initialization. memcpy's still generate 64 bit moves. Looking for 128 or 256 bit moves if possible w/o resorting to intrinsics. — , Aug 25 '16 at 05:28
If this buffer is large, then yes use `calloc` to take advantage of the fact that new page from the OS start out zeroed anyway (to avoid information leaks). When reusing a buffer, don't `memset` the whole buffer before writing parts of it; that will probably be less efficient than writing it in order with a mix of zeros and freshly-loaded data (especially if it's bigger than L1 cache). — Peter Cordes, Aug 25 '16 at 06:32

score 6 · Answer 1 · answered Aug 25 '16 at 07:52

A version, using, e.g movups or movdqu, means effectively executing the assignments in parallel, which may be incorrect if the parameters of the function may alias.

If they don't alias, you can use the non-standard __restrict__ keyword.

That said, gcc will still vectorize only loops, so rewrite the program like:

// #define __restrict__ 
void reorder(float * __restrict__ output, float * __restrict__ input) {

  for (auto i = 0; i < 5; i++)
    output[i+56] = input[i];

  for (auto i = 0; i < 6; i++)
    output[i+75] = input[i+19];

  for (auto i = 0; i < 7; i++)
    output[i+98] = 0;

  for (auto i = 0; i < 6; i++)
    output[i+105] = input[i+1];

  output[111] = 0; 
}

This, compiled with -O2 -ftree-vectorize, generates:

reorder(float*, float*):
        movups  xmm0, XMMWORD PTR [rsi]
        movups  XMMWORD PTR [rdi+224], xmm0
        movss   xmm0, DWORD PTR [rsi+16]
        movss   DWORD PTR [rdi+240], xmm0
        movups  xmm0, XMMWORD PTR [rsi+76]
        movups  XMMWORD PTR [rdi+300], xmm0
        movss   xmm0, DWORD PTR [rsi+92]
        movss   DWORD PTR [rdi+316], xmm0
        movss   xmm0, DWORD PTR [rsi+96]
        movss   DWORD PTR [rdi+320], xmm0
        pxor    xmm0, xmm0
        movups  xmm1, XMMWORD PTR [rsi+4]
        movups  XMMWORD PTR [rdi+392], xmm0
        pxor    xmm0, xmm0
        movups  XMMWORD PTR [rdi+420], xmm1
        movss   DWORD PTR [rdi+408], xmm0
        movss   DWORD PTR [rdi+412], xmm0
        movss   xmm1, DWORD PTR [rsi+20]
        movss   DWORD PTR [rdi+436], xmm1
        movss   xmm1, DWORD PTR [rsi+24]
        movss   DWORD PTR [rdi+416], xmm0
        movss   DWORD PTR [rdi+440], xmm1
        movss   DWORD PTR [rdi+444], xmm0
        ret

Not ideal, but still some of moves are done with a single insn.

https://godbolt.org/g/9aSmB1

Nice idea, I didn't think of trying short loops. But yes, gcc will happily peel out the non-multiple-of-4 iterations. (It won't use `movsd` or `movq` to copy 2 floats at once, though. clang does, though.) — Peter Cordes, Aug 25 '16 at 07:57

score 4 · Accepted Answer · edited May 23 '17 at 10:29

First of all, it might (or might not) be worth doing this on the fly while reading from input[] in some other function. If there's any kind of pattern to the shuffle, it might not be too bad. OTOH, it might defeat prefetching in whatever you feed this array to.

Have you tried using __restrict__ to tell the compiler that accesses through one pointer can't alias with accesses through the other pointer? If it can't prove that on its own, it's not allowed to combine loads or stores when the source interleaves them like that.

restrict is a C99 feature that's not included in ISO C++, but the common C++ compilers support __restrict__ or __restrict as an extension. Use a CPP macro to #define __restrict__ __restrict on MSVC, or to the empty string on compilers that don't support any equivalent.

gcc sucks at load/store coalescing, but clang doesn't.

It's been a long-standing bug in gcc that it's bad at coalescing loads/stores (see bugzilla link, but I think I remember seeing another bug report from even longer ago, like from gcc4.0 or something). It's more usually encountered with struct-copying (generating member-by-member loads/stores), but it's the same issue here.

With __restrict__, clang is able to coalesce most of the loads/stores in your example function into xmm or ymm vectors. It even generates a vector load and scalar vpextrd of the last three elements! See the code + asm output on the Godbolt compiler explorer, from clang++3.8 -O3 -march=haswell.

With the same source, g++6.1 still totally fails to coalesce anything, even the contiguous zeros. (try flipping the compiler to gcc on godbolt). It even does a poor job with a small memcpy, not using SIMD even though we compile with -march=haswell where unaligned vectors are very cheap. :/

If there's any kind of pattern, taking advantage of that in the reorder() function to save code-size will help significantly. Even if loads/stores coalesce into SIMD vectors, you'll still blow out the uop cache and L1 instruction cache. Code-fetch will be competing with data load/store for L2 bandwidth. Once your array indices get too big for an 8-bit displacement, the size of each instruction will get even bigger. (opcode bytes + ModRM byte + disp32). If it's not going to coalesce, it's too bad that gcc doesn't optimize these moves into 32-bit mov instructions (1 opcode byte) instead of movss (3 opcode bytes)

So after this function returns, the rest of your program will run slower than normal for a very short time, since the 32kiB L1 instruction cache and the even smaller uop cache will be cold (full of mov instructions from the bloated reorder function). Use perf counters to look at I-cache misses. See also the x86 tag wiki to learn more about x86 performance, especially Agner Fog's guides.

As you suggested in comments, calloc is a good way to avoid the zeroing parts when you need a fresh output buffer. It does take advantage of the fact that new pages from the OS start out zeroed anyway (to avoid information leaks). It's better to reuse an existing buffer than to free it and calloc a new one, though, because the old one will still be hot in cache and/or TLB. And at least the pages will still be wired, instead of having to be faulted in when you first touch them.

Using memcpy and memset instead of per-element assignments might well help your compile-times. If the source is very repetitive, you can probably write something in perl (or the text-manipulation language of your choice) to turn contiguous runs of copying into memset calls.

If there any big runs (like 128 bytes or more), the ideal asm is rep movsd (or rep movsq) on many CPUs, especially recent Intel. gcc usually inlines memcpy to rep movs instead of calling library memcpy when the size is known at compile-time, and you can even tune the strategy (SIMD vs. rep movs) with -mstringop-strategy. The savings in code-size will probably be a significant benefit for you, if there's no pattern that lets you code this as a loop.

If your pattern allows it, it's probably worth copying a larger contiguous block and then going back to zero or copy something else into a few elements, because rep movs has significant startup overhead but extremely good performance once it's up and running. (Avoiding read-for-ownership overhead when it's storing an entire cache lines on Intel CPUs, even before IvB's fast rep movsb feature according to Andy Glew (who implemented it in P6).)

If you can't just compile this object file with clang instead of gcc, maybe you should look at generating the asm for it yourself. If this slows down your program significantly (and copying around that much memory + nuking the instruction cache might well do that), then some text-processing could convert a list of ranges into asm that sets up rsi, rdi, and ecx for rep movsd.

Reading the input in order might give better performance than writing the output in order. Cache-miss stores usually have less impact on the pipeline than cache-miss loads. OTOH, doing all the stores for a single cache-line together is probably a good thing. Worth playing with if this is a significant bottleneck.

If you did use intrinsics, it might be worth using NT stores for the parts of a contiguous run that cover a whole cache lines (64B size, 64B aligned), if your array is really big. Or maybe do the stores in sequential order, using NT stores?

NT loads might be a good idea, but IDK if the NT hint does anything at all on normal write-back memory. They're not weakly-ordered, but there are ways they could cause less cache pollution (see that link for my guess).

Doing the shuffle in-place might be a good idea, if that's useful for your program. Since the output includes some runs of zeroes, I assume it's longer than the input. In that case, doing it in-place is probably easiest if you start from the end of the array. I suspect that an in-place shuffle isn't what you want, so I won't say too much more about it.

Thank you for your extended response Peter, I've learned a lot reading your post. Could you extend the section on L1 instruction cache and i-cache pressure in general? I'm familiar with the terminology, but I'd like to learn more. — , Aug 25 '16 at 17:47
@CarlodelMundo: ok done. IDK how often you run this function, or how much other code runs in between. Out-of-order execution depends on being able to fetch ahead of what's being executed, so every jump/call to a cold block of code is a significant stall in execution. Data cache miss latency can be partially hidden by out-of-order execution, but I-cache misses leave the CPU with no idea what to do next. — Peter Cordes, Aug 25 '16 at 19:51

Optimizing Long Runs of Memory Reads and Writes

2 Answers2

gcc sucks at load/store coalescing, but clang doesn't.