The effect of code alignment in timing main loops in assembly

Question

Let's say I have the following main loop

.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2

The way I would time this is to put it in another long loop like this

;align 32              
.L1:
    mov             rax, rcx
    neg             rax
align 32
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1                 ; r8 contains a large integer
    jnz             .L1

What I'm finding is that the alignment I choose can have a significant effect on the timing (up to +-10%). It's not clear to me how to choose the code alignment. There are three places I can think of where I might want to align the code

At the entry to the function (see e.g. triad_fma_asm_repeat in the code below)
At the start of the outer loop (.L1 above) which repeats my main loop
At the start of my main loop (.L2 above).

Another things I have found is that if I put another routine in my source file that changing one instruction (e.g. removing an instruction) can have a significant effect on the timing of the next function even when they are independent functions. I have even seen this in the past affect a routine in another object file.

I have read section 11.5 "Alignment of code" in Agner Fog's optimizing assembly manual but it's still not clear to me the best way to align my code for testing performance. He give an example, 11.5, of timing an inner loop which I don't really follow.

Currently getting the highest performance from my code is a game of guessing different values and locations of alignment.

I would like to know if there is an intelligent method to choose the alignment? Should I align the inner and outerloop? Just the inner loop? The entry to the function as well? Do using short or long NOPs matter?

I'm mostly interested in Haswell, followed by SNB/IVB, and then Core2.

I have tried both NASM and YASM and have discovered that this is one area where they differ significantly. NASM only inserts one byte NOP instructions where YASM inserts multi-byte NOP. For example by aligning both the the inner and outer loop above to 32 bytes NASM inserted 20 NOP (0x90) instructions where as YASM inserted the following (from objdump)

  2c:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 nopw  %cs:0x0(%rax,%rax,1)
  33:   0f 1f 84 00 00 00 00 
  3a:   00 
  3b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

So far I have not observed a significant difference in performance with this. It appears that it's alignment that matters not the instruction length. But Agner writes in the aligning code section:

It is more efficient to use longer instructions that do nothing than to use a lot of single-byte NOP's.

If you want to play with the alignment and see the effects yourself bellow you can find both the assembly and C code I use. Replace double frequency = 3.6 with the effective frequency of your CPU. You may want to disable turbo.

;nasm/yasm -f elf64 align_asm.asm`
global triad_fma_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]
pi: dd 3.14159

section .text
align 16
triad_fma_asm_repeat:

    shl             rcx, 2
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx

;align 32
.L1:
    mov             rax, rcx
    neg             rax
align 32
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_fma_store_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]

align 16
    triad_fma_store_asm_repeat:
    shl             rcx, 2
    add             rcx, rdx
    sub             rdi, rdx
    sub             rsi, rdx
    vbroadcastss    ymm2, [rel pi]

;align 32
.L1:
    mov             r9, rdx
align 32
.L2:
    vmulps          ymm1, ymm2, [rdi+r9]
    vaddps          ymm1, ymm1, [rsi+r9]
    vmovaps         [r9], ymm1
    add             r9, 32
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

Here is the C code I use to call the assembly routines and time them

//gcc -std=gnu99 -O3        -mavx align.c -lgomp align_asm.o -o align_avx
//gcc -std=gnu99 -O3 -mfma -mavx2 align.c -lgomp align_asm.o -o align_fma
#include <stdio.h>
#include <string.h>
#include <omp.h>

float triad_fma_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
float triad_fma_store_asm_repeat(float *x, float *y, float *z, const int n, int repeat);

float triad_fma_repeat(float *x, float *y, float *z, const int n, int repeat)
{
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
        }
    }
}

int main (void )
{
    int bytes_per_cycle = 0;
    double frequency = 3.6;
    #if (defined(__FMA__))
    bytes_per_cycle = 96;
    #elif (defined(__AVX__))
    bytes_per_cycle = 48;
    #else
    bytes_per_cycle = 24;
    #endif
    double peak = frequency*bytes_per_cycle;

    const int n =2048;

    float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
    char *mem = (char*)_mm_malloc(1<<18,4096);
    char *a = mem;
    char *b = a+n*sizeof(float);
    char *c = b+n*sizeof(float);

    float *x = (float*)a;
    float *y = (float*)b;
    float *z = (float*)c;

    for(int i=0; i<n; i++) {
        x[i] = 1.0f*i;
        y[i] = 1.0f*i;
        z[i] = 0;
    }
    int repeat = 1000000;    
    triad_fma_repeat(x,y,z2,n,repeat);   

    while(1) {
        double dtime, rate;

        memset(z, 0, n*sizeof(float));
        dtime = -omp_get_wtime();
        triad_fma_asm_repeat(x,y,z,n,repeat);
        dtime += omp_get_wtime();
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("t1     rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));

        memset(z, 0, n*sizeof(float));
        dtime = -omp_get_wtime();
        triad_fma_store_asm_repeat(x,y,z,n,repeat);
        dtime += omp_get_wtime();
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("t2     rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));

        puts("");
    }
}

I'm bothered by the following statement in the NASM manual

A final caveat: ALIGN and ALIGNB work relative to the beginning of the section, not the beginning of the address space in the final executable. Aligning to a 16-byte boundary when the section you're in is only guaranteed to be aligned to a 4-byte boundary, for example, is a waste of effort. Again, NASM does not check that the section's alignment characteristics are sensible for the use of ALIGN or ALIGNB.

I'm not sure the code segment is getting an absolute 32-byte aligned address or only a relative one.

I'm curious at why code-alignment even matters for small loops like this. I'd assume that after one iteration, it would be decoded and stored in the uop cache or loop-back buffer. — Mysticial, Oct 30 '15 at 15:21
@Mysticial, yes, those were words I was looking for I should have had in my question: uop cache or loop-back buffer. I wonder the same thing. But it's what I observe. It's difficult to make very precise statements about timing right now when alignment can have so much an effect. — Z boson, Oct 30 '15 at 15:24
It's possible that they need to handle the case of self-modifying code. So the processor still needs to probe the addresses of the original instructions. — Mysticial, Oct 30 '15 at 15:25
@Mysticial, have you observed this before? I had seen it on IVB but did not notice on HSW until yesterday. I was looking into [/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62](http://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62) again yesterday after [Peter Cordes proved that micro-fusion does not work with multi-register addressing mode](http://stackoverflow.com/a/31027695/2542702) and then I found that by removing one instruction in another routine it effect the performance in another function I was testing. — Z boson, Oct 30 '15 at 15:29
@Mysticial, so now I don't have much confidence in the efficiencies I quote/measure, at least not more than say +-10%. — Z boson, Oct 30 '15 at 15:31
I don't write much assembly anymore. I just rely on the compiler to deal with the alignment stuff. — Mysticial, Oct 30 '15 at 15:32
@Mysticial, I'm sure I could figure it out by dissecting Agner Fog's [testp](http://www.agner.org/optimize/testp.zip) code. But that takes time which maybe someone on SO could save me from wasting. Also I think the question may be interesting to others. — Z boson, Oct 30 '15 at 15:35
I have read something from Agner (don't remember where) that mentioned op-fusion as being affected by alignment. It mentioned something like fusion would fail if the two instructions are decoded in different cycles - which is definitely affected by the alignment of the instructions. — Mysticial, Oct 30 '15 at 15:39
Seems like memory was only partially correct. I found what I was referring to: "The branch instruction should not start at a 16-bytes boundary or cross a 16-bytes boundary." - [page 106](http://www.agner.org/optimize/microarchitecture.pdf). — Mysticial, Oct 30 '15 at 15:50
@Mysticial, well that could explain some cases if I'm unlucky. What's strange to me is I think my inner and outer loop are 31 bytes they should fit in the uop cache. Don't know about the loop back buffer. How does alignment matter fro the uop cache? I guess I have to start reading the microarch manual again. I had not written assembly in nearly a year. — Z boson, Oct 30 '15 at 16:02
@Mysticial, p121 "The μop cache is organized as 32 sets x 8 ways x 6 μops, totaling a maximum capacity of 1536 μops. It can allocate a maximum of 3 lines of 6 μops each for each aligned and contiguous 32-bytes block of code." That sounds like alignment could make a big difference. — Z boson, Oct 30 '15 at 19:46
@Mysticial: On more recent CPUs, the decoders hold onto the last uop if it's one that can micro-fuse, in case the next instruction is a jcc. This is worse for the not-in-uop-cache case, but better for the already-cached case. I'll have another look at this question in a few hours, and see if I have any ideas. gtg for now. — Peter Cordes, Oct 31 '15 at 22:43
NASM has the ability to do smarter alignment as well, it just isn't the default behaviour. Use the smartalign directive to enable it - it will be more strategic with NOPs, and insert jmps if the alignment is large enough for it to be beneficial. http://www.nasm.us/doc/nasmdoc5.html I suspect yasm just includes this behaviour by default. — Matt O, Nov 02 '15 at 06:00
@MattO, thank you, I just tested it and it creates multi-byte NOPs as you say. Now I just need to figure out the ideal alignment. Probably I should be using performance counters but I do things by experiment. — Z boson, Nov 02 '15 at 20:29
@PeterCordes, if you come up with anything please let me know. Do you think my question is clear enough or is it somewhat ambiguous? Do you think it's worth making a bounty out of or have I over looked something silly? I don't use assembly every day (except for the last week). — Z boson, Nov 02 '15 at 20:52
It's still on my to-do list. I think it's clear enough. I'm surprised alignment makes any diff, though, since the 28uop loop buffer doesn't have any of the complications of 6-uop cache lines. If you're using short NOPs, the loop buffer *may* suffer from the same limitation as the uop cache: not being able to cache more than 18 uops for 32B of code (forcing the loop to swap between the decoders and the loop buffer for every outer-loop iteration). — Peter Cordes, Nov 03 '15 at 01:04
Oh, I have an idea: if two macro-fusable pairs hit the decoders in the same cycle, Agner Fog's docs say only one pair will fuse. Alignment could cause that pair of alu/jcc insns to all hit the decoders in the same cycle. I forget if he specifically re-tested this on SnB, but probably, because he did find what I said before, that macro-fusable instructions decode more slowly (if they're the last in a group of 4), to enable finding more macro-fusion opportunities for the benefit of the uop cache. — Peter Cordes, Nov 03 '15 at 01:10
You can use perf counters to see how many uops per insn you're getting. (`ocperf.py` on Linux has symbolic names for more hardware-specific counters, like uops, compared to `perf`). When I get around to testing this hypothesis about macro-fusion not happening, or if you do, I'll post it as an answer. — Peter Cordes, Nov 03 '15 at 01:15
It's hard to imagine this lack of macro-fusion for the 2nd pair (outer loop) making a 10% difference, though. It would bring the outer loop from 8 to 9 uops (assuming no micro-fusion, so all the insns with memory operands are 2 uops each). Maybe there's a minor frontend bottleneck then, so the scheduler can't see as far ahead to optimally schedule instructions and keep everything busy? — Peter Cordes, Nov 03 '15 at 01:52
@PeterCordes, it turns out I can get the effect even without the reads and writes. With simply the loop counters and branches. So it probably has nothing to do with micro-op fusion. That's a red herring. I'll post code shortly. It's quite absurd now but I get the best result by putting in 1024 nop operation at the start of the function and then aligning both the inner and outer loop by 32 bytes. I get 92% efficiency this way (it's only an increment and branch so it should be one per cycle). Without alignment I can get it down to 55% efficiency. So that' more than +-10%. — Z boson, Nov 03 '15 at 08:20
@Zboson: I was speculating about **macro**-fusion, *not* **micro**-fusion. But it really sounds like a 32B boundary or 6-uop cache-line effects somehow matter, even though the loop buffer should give this loop full 4-uop per clock throughput out of the frontend. — Peter Cordes, Nov 04 '15 at 00:29

score 2 · Answer 1 · answered Oct 06 '16 at 19:37

Regarding your last question about relative (within-section) alignment and absolute (in memory at runtime) - you don't have to worry too much. Just below the section of the manual you quoted which warns about ALIGN not checking the section alignment, you have this:

Both ALIGN and ALIGNB do call SECTALIGN macro implicitly. See section 4.11.13 for details.

So basically ALIGN doesn't check that the alignment is sensible, but it does call the SECTALIGN macro so that the alignment will be sensible. In particular, all the implicit SECTALIGN calls should insure that the section is aligned to the largest alignment specified by any align call.

The warning about ALIGN not checking then probably only applies to more obscure cases, e.g., when assembling into formats that don't support section alignment, when specifying an alignment larger than that supported by a section, or when SECTALIGN OFF has been called to disable SECTALIGN.

I think you may have explained a problem I had a while ago that I never solved. Sadly, I'm a little rusty with this stuff. Maybe I will find some time to look into it again soon. Interesting enough I actually though about this problem recently and how it annoyed me that my timing results seemed to have an uncertainty of +-10% and I could not explain why. — Z boson, Oct 09 '16 at 14:13

Richard Parker · Answer 2 · 2016-02-19T15:09:53.697

Your loop should ideally (just about) execute in one iteration per clock-cycle, having four mu-ops (add/jne being one). A critical question is the predictability of the inner-loop branch. Up to 16 iterations it should be predicted in the timing code, being always the same, but after that you might be struggling. Firstly, to answer your question, the key alignments for timing are to ensure that neither the code after the jne .L2, nor the first instruction after .L2 cross a 32-byte boundary. I presume that the real question is how to make it run faster, and if my guess of > 16 iterations is correct, the key objective is to make the branch prediction work. To make your timing times shorter should be easy - it is sufficient to have several branches that are all predictable. To make the final code run faster, however, depends on how the real-world values of rax vary, and this will depend also on the routine that calls the loop.

The effect of code alignment in timing main loops in assembly

2 Answers2

Linked