Why is the cycle count of this loop similar in the aligned and unaliged cases?

Question

The cpu in my testbed is "Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz", so it's Ivy Bridge and the third generation intel core process.

I use the testp provided by Agner Fog to do an analysis.

According to the following test result, if I align the asm code, the front stall cycles would be much small, but the number of instructs or uops would increase. and totally the core cycles are similar.

So what should I do to optimize the asm code? or where I am wrong?

Please pay attention to the last column. As I understanding, it means the front stall, so I try to reduce by using ".align 16" to get a better performance, you can see I use three .align, and RESOURCE_STALL.ANY is almost null.

If I don't add .align 16, the test result:

  Clock    BrTaken   Core cyc   Instruct    Uops        Uops F.D.    rs any
60763677   49076140   66968169  229533042  164009900   164144552   17506254 
59512474   49076140   66856747  229533041  164003819   164133385   17481613

If I align the asm code, the test result:

  Clock    BrTaken   Core cyc   Instruct     Uops       Uops F.D.     rs any 
73537615   49076140   82188535  311530045  246006910   246064159        3 
74063027   49076140   82217717  311530045  246002615   246067520        4

If I remove the 2nd .align and move the 3nd before the label, the result:

     Clock    BrTaken   Core cyc   Instruct       Uops     rs any 
  59930621   49076358   66898617  245995143  180472213   17440098 
  59596305   49076140   66886858  245993040  180461441   17487217

The information of the counter:

              event mask   comment
   BrTanken   0xa2  0x01   branches taken
   Uops       0xc2  0x01   uops retired, unfused domain.
   Uops F.D   0x0E  0x01   uops, fused domain, Sandy Bridge.
   rs any     0xa2  0x01   Cycles Allocation are stalled due to Resource Related reason.
   Instruct   0x40000000   Instructions (reference counter)

the asm code, I get it from gcc O1.

.section .text
.globl sum_sort_O1
.type sum_sort_O1, @function
sum_sort_O1:
        movl    $100000, %esi
        movl    $0, %eax
        jmp     .L2 
#.align 16 --- 1
.L6:
        movl    (%rdi,%rdx), %ecx
        cmpl    $127, %ecx
        jle     .L3 
#.align 16 --- 2
        movslq  %ecx, %rcx
        addq    %rcx, %rax
.L3:
#.align 16 --- 3
        addq    $4, %rdx
        cmpq    $131072, %rdx
        jne     .L6 
        subl    $1, %esi
        je      .L5 
.L2:
        movl    $0, %edx
        jmp     .L6 
.L5:
        rep ret

The c code:

#define  arraySize  32768

long long sort(int* data)
{
    long long sum = 0;
     for (unsigned i = 0; i < 100000; ++i)
    {   
        for (unsigned c = 0; c < arraySize; ++c)
        {   
            if (data[c] >= 128)
                sum += data[c];
        }   
    }   
}

============================================================== Inter Performance-Monitor Event

========================================================================= Optimization

I'm reading the document "64-ia-32-architectures-optimization-manual.pdf" and I find a method "TOP-DOWN ANALYSIS METHOD" in the section Appendix B.1. And I try to optimize this section of codes with the method.

The first problem which I met is the branch misprediction, and we could sort the data to resolve it.

The second problem is this one, the front stall and reource_stall is very big. If I add .align, the stall is much reduced, but the totally time is not reduce. So I think my direction is right but my method is not good, that is why I list the problem here to ask some idea.

I know optimize this section of code is a big work, I try to optimize it step by step with the method "TOP-DOWN ANALYSIS METHOD".

========================================================================== Back Ground

I'm interesting in the question. "Why is it faster to process a sorted array than an unsorted array?"

Yes, the problem which the author met is Branch Prediction.

And I want to optimize the code, when the Branch Prediction has been resolved.

You have three instances of `.align 16` in your code. The second one seems completely out of place to me and I think you should remove it. The third one needs to go before the label instead of after the label. It is counter-productive where it is. I'd be very interested to know how these suggestions affect your results. — prl, Mar 09 '18 at 03:27
I'm not sure if you are trying to improve the performance of this code, or simply want to know why it behaves as it does. If the former, you can remove the jle and replace it with cmovle. Also you can remove the movslq. ecx is known to be greater than 0 and the upper half of rcx is known to be 0. — prl, Mar 09 '18 at 03:33
@prl, I'm trying to optimize it. in another test case I use cmovl to reduce the branch, while the result is similar with this one, from my point of view, the reason is the data has been sorted. — Forward, Mar 09 '18 at 05:28
@prl, I don't get your mean "It is counter-productive where it is. I'd be very interested to know how these suggestions affect your results", you mean why I add .align 16? or ? — Forward, Mar 09 '18 at 05:29
The purpose of the align directive is to align the target of the jump. But since the label is before the align, the target of the jump is *unaligned*, plus it then has to execute all the NOPs that were inserted by the align. — prl, Mar 09 '18 at 05:31
If you remove the second of the three align directives, and move the third one before the label, what results do you get, and how do they compare with the results when no align directives are present? — prl, Mar 09 '18 at 05:32
I add test result to the question, there is a little increase in the instruction as well uops, and the core cycles and rs.any is similar with no align code. — Forward, Mar 09 '18 at 05:46
If you want to optimize this in general, not just understand the alignment effects on what you see with perf counters, there are a few obvious things: Use `movslq` for the load from memory, instead of separately loading and then using an ALU uop for `movslq`. Intel CPUs (unlike AMD) handle `movslq` as a pure load uop, not micro-fused ALU+load, because the load port itself can sign extend. Some unrolling to amortize the loop overhead would help, too. — Peter Cordes, Mar 09 '18 at 06:06
Indexing from the end of the array with a negative index counting up towards zero would save the `cmp` instruction in the loop branch. Indexed addressing modes aren't great, but [for a pure load (not micro-fused)](https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes)) there's no penalty (unlamination) even on IvyBridge. If you're intentionally optimizing for the sorted case, you could make two loops, and switch to the other loop after seeing a value with the other `if()` condition, so there are fewer taken branches in the loop. Helps more on Haswell than IvB though — Peter Cordes, Mar 09 '18 at 06:10
Obviously you're making this semi-artificial by not just using SSE4 / AVX1 to check the condition and mask, [then `vpmovsxdq`](https://github.com/HJLebbink/asm-dude/wiki/PMOVSX). Although that is a little clunky, taking 3 shuffles to get 2 sign-extended vectors from one vector of 4 `int` elements. — Peter Cordes, Mar 09 '18 at 06:13
@PeterCordes, I know that there should be lots of work to optimize this code, I try to follow the method "TOP-DOWN ANALYSIS METHOD" to recognize the problems and resolve them one by one. — Forward, Mar 12 '18 at 02:31

BeeOnRope · Accepted Answer · 2018-03-09T05:02:10.057

4

Alignment of loops⁰ may help performance on some architectures in some scenarios, but it may not. In particular, when you omit an explicit directive it may be that the natural alignment is just fine¹.

How likely this is depends on why alignment matters: let's say you suffer a penalty if your loop body crosses a 16-byte boundary within the first 5 bytes of instructions from the top: even if you don't force alignment, this only has a ~27% chance of happening if the entry point is randomly distributed, so in this case alignment usually makes no difference.

Even if alignment removes a penalty in one part of the pipeline (usually from the front-end), it may very well be that that component isn't the overall bottleneck, so avoiding the penalty doesn't improve the final performance.

Newer x86 CPUs have also in some ways reduced the impact of code alignment by introducing things like the uop cache, and having overall better front-end performance (e.g., decoding 5 instructions/cycle on Skylake) which makes it less likely that the front-end is a bottleneck.

More to the point, in your particular example, the last two out of the three .align locations are internal to the main loop! This means that the instructions executed for alignment will may be executed frequently, meaning that alignment can have a large downside that may more than cancel out any gains. You want alignment only in cases where the alignment padding is only rarely executed, such as before a hot loop - but not inside it! This problem is why your instruction count jumps by a large amount in the "aligned" case.

Anyways, alignment will have little effect on an algorithm like this one with dense non-loop branching, unless all those branches are well predicted: the branch misprediction costs will dominate instead.

⁰ Generally speaking this refers to aligning the first instruction in a loop, aka the "loop top", but as it turns out other types of alignment may also be useful (these other types are never implemented in compilers, to my knowledge).

¹ This is in contrast to say architectures which suffer a penalty on mis-aligned memory access. There it is often the case that only "exactly aligned" accesses run at full speed, but anything else is misaligned and suffers a penalty (and there may be even larger penalties for special misaligned cases, e.g., cache-line or page crossing access). Alignment for code generally doesn't work like that: alignment is only a heuristic that generally helps avoid some possible penalty, but there may be may other "not aligned" configurations that also avoid the penalty.

edited Mar 09 '18 at 05:02

answered Mar 09 '18 at 03:42

BeeOnRope

60,350
16
207
386

Hello BeeOnRope, in this case, I observed rs.any is very high, as my understand, it means front stall, so I try to reduce it by add align. the rs.any is reduced to almost 0, but the instruction and uops is increase, so is there a method which could reduce the rs.any and not increase instruction,so the core cycles could be reduced. – Forward Mar 09 '18 at 05:55
Well the `rs` events have subcategories you can check which tell you the components that make up the `all` number, so that would be something to check. The "resource stall" counters can be very misleading though: just reducing them doesn't necessarily improve performance: you can have a large number of resource stalls, but the actual bottleneck may be elsewhere, so reducing the resource stalls to zero may have next to no impact at all (or it may have a very large one). You are really asking a more general question about optimization now though - why not create a new question? – BeeOnRope Mar 09 '18 at 06:05
@Forward All that said, most certainly alignment of loops will not be the low hanging fruit here - there will be many things to do first. Make your other question with include MCVE, and in particular describe the distribution of the input data you are interested in. – BeeOnRope Mar 09 '18 at 18:35
I agree with your idea partially, alignment of loops may not reduce the core cycles, and optimize this section of code is a big job. But from my point of view my direction is right, I follow the method "TOP-DOWN ANALYSIS METHOD" in the document "64-ia-32-architectures-optimization-manual". I try to optimize the code step by step, now I meet this issue and try to resolve it. – Forward Mar 12 '18 at 02:20
@forward - as I mentioned, the advice to align loops doesn't usually apply when the alignment instructions end up _inside_ a hot loop. Loop alignment will rarely help an resource stall: these are things further down the pipeline like allocating RS entries or load buffers or whatever. See [here](https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/pmw_sp/events/resource_stalls.html). – BeeOnRope Mar 12 '18 at 02:46
So, you mean I use the wrong method to resolve this front stall issue, or from your point of view, the stall is not the bottleneck? Do you agree with the method which I used to do analysis? – Forward Mar 12 '18 at 03:07
@Forward - potentially both. You are definitely using the wrong method to resolve this "issue" since code alignment isn't going to help any of the possible RS stall sources (those are all things that happen later in the pipeline), and also it might be the case that an RS "stall" may not be an "issue to be resolved" at all. The latter is hard to say without more details. I recommend using `toplev.py` rather than trying to implement the topdown approach by hand: this tool already includes all the logic and interpretation. – BeeOnRope Mar 12 '18 at 18:57
All that said, comments aren't the right place to get into a detailed discussion of something different than the specific question you asked originally. If you have a new query about how to optimize this code, ask it and link it here. – BeeOnRope Mar 12 '18 at 18:57
1

Thanks BeeOnRope, I get your idea, and you give me a direction and I'll do more read about toplev.py. and maybe I'll use it to do more analysis, and create a new query if I meet new issue. – Forward Mar 13 '18 at 05:12

Why is the cycle count of this loop similar in the aligned and unaliged cases?

1 Answers1