The cpu in my testbed is "Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz", so it's Ivy Bridge and the third generation intel core process.
I use the testp provided by Agner Fog to do an analysis.
According to the following test result, if I align the asm code, the front stall cycles would be much small, but the number of instructs or uops would increase. and totally the core cycles are similar.
So what should I do to optimize the asm code? or where I am wrong?
Please pay attention to the last column. As I understanding, it means the front stall, so I try to reduce by using ".align 16" to get a better performance, you can see I use three .align, and RESOURCE_STALL.ANY is almost null.
If I don't add .align 16, the test result:
Clock BrTaken Core cyc Instruct Uops Uops F.D. rs any
60763677 49076140 66968169 229533042 164009900 164144552 17506254
59512474 49076140 66856747 229533041 164003819 164133385 17481613
If I align the asm code, the test result:
Clock BrTaken Core cyc Instruct Uops Uops F.D. rs any
73537615 49076140 82188535 311530045 246006910 246064159 3
74063027 49076140 82217717 311530045 246002615 246067520 4
If I remove the 2nd .align and move the 3nd before the label, the result:
Clock BrTaken Core cyc Instruct Uops rs any
59930621 49076358 66898617 245995143 180472213 17440098
59596305 49076140 66886858 245993040 180461441 17487217
The information of the counter:
event mask comment
BrTanken 0xa2 0x01 branches taken
Uops 0xc2 0x01 uops retired, unfused domain.
Uops F.D 0x0E 0x01 uops, fused domain, Sandy Bridge.
rs any 0xa2 0x01 Cycles Allocation are stalled due to Resource Related reason.
Instruct 0x40000000 Instructions (reference counter)
the asm code, I get it from gcc O1.
.section .text
.globl sum_sort_O1
.type sum_sort_O1, @function
sum_sort_O1:
movl $100000, %esi
movl $0, %eax
jmp .L2
#.align 16 --- 1
.L6:
movl (%rdi,%rdx), %ecx
cmpl $127, %ecx
jle .L3
#.align 16 --- 2
movslq %ecx, %rcx
addq %rcx, %rax
.L3:
#.align 16 --- 3
addq $4, %rdx
cmpq $131072, %rdx
jne .L6
subl $1, %esi
je .L5
.L2:
movl $0, %edx
jmp .L6
.L5:
rep ret
The c code:
#define arraySize 32768
long long sort(int* data)
{
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
}
============================================================== Inter Performance-Monitor Event
========================================================================= Optimization
I'm reading the document "64-ia-32-architectures-optimization-manual.pdf" and I find a method "TOP-DOWN ANALYSIS METHOD" in the section Appendix B.1. And I try to optimize this section of codes with the method.
The first problem which I met is the branch misprediction, and we could sort the data to resolve it.
The second problem is this one, the front stall and reource_stall is very big. If I add .align, the stall is much reduced, but the totally time is not reduce. So I think my direction is right but my method is not good, that is why I list the problem here to ask some idea.
I know optimize this section of code is a big work, I try to optimize it step by step with the method "TOP-DOWN ANALYSIS METHOD".
========================================================================== Back Ground
I'm interesting in the question. "Why is it faster to process a sorted array than an unsorted array?"
Yes, the problem which the author met is Branch Prediction.
And I want to optimize the code, when the Branch Prediction has been resolved.