What does this 'efficiency rating' mean by analysing assembly instructions?

Question

One of my data structures and algorithms questions is to analyse assembly code and figure out the time complexity of it. I just don't understand the example my professor gave me.

It is to turn C code into assembly instructions using http://gcc.godbolt.org/

The C code is:

int main(void)
{
 int a[] = {11,-22,33}, length = 3, count = 0;

 for (int i = 0 ; i < length ; i++)
    if (a[i] > 5) 
       count++;
}

The assembly code is:

main:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $11, -32(%rbp)
    movl    $-22, -28(%rbp)
    movl    $33, -24(%rbp)
    movl    $3, -12(%rbp)
    movl    $0, -4(%rbp)
    movl    $0, -8(%rbp)
.L4:
    movl    -8(%rbp), %eax
    cmpl    -12(%rbp), %eax
    jge .L2
    movl    -8(%rbp), %eax
    cltq
    movl    -32(%rbp,%rax,4), %eax
    cmpl    $5, %eax
    jle .L3
    addl    $1, -4(%rbp)
    .L3:
    addl    $1, -8(%rbp)
    jmp .L4
.L2:
    movl    $0, %eax
    popq    %rbp
    ret

And his explanation is:

Best case: Number of Instructions = 2 + n + 3 + 3(n+1) + 5n + 0 + 2n + 3 = 11n + 11

Worst case: Replace 0 by n to get Number of Instructions = 12n + 11

His rough explanation is:

Because we both approximate and also ignore actual constants (eg 11*t0), it is easier to reason as follows. The loop is executed n times, and the loop contents takes constant time, so total time is O(n). Most of the time we only use a rough analysis, which can be justified by the detailed analysis.

I do not understand the explanation. Can someone explain this to me better?

If you're looking at asm on godbolt, don't hurt your brain wading through all the loads / stores that you get with no optimization. Between `-Og` (optimize for debugging), `-O2`, and `-O3` for this code, I found `-O2` by far the easiest to follow. The only conditional branch is the loop. It uses `setcc` to do count += (a[i]>5) without branches. You have to put your code in a separate function so it doesn't compile away to nothing, of course, like when it inlines. See http://goo.gl/m2TBnn. `-O3` autovectorizes it into a wild mess. :P — Peter Cordes, Sep 17 '15 at 05:03
I know having `-O2` optimize it into branchless code is not at all what you're asking, defeating the purpose of this exercise. It's still true that it's usually easier to follow branchless code, esp. with meaningless-number labels and basic blocks in weird orders. For exercises like this, I'd recommend `-Og` for looking at asm output. `-Og` keeps the asm matching the source. It means "optimize for debugging". It keeps the C loop structure, with a compare at the top and a `jmp` the the bottom, which sucks for asm readability if you're expecting a usual asm loop. — Peter Cordes, Sep 17 '15 at 05:09
Forgot to mention: `-fverbose-asm` appears to work very will with `-Og`. With higher optimization levels, most registers are holding some `tmp.1234`. — Peter Cordes, Sep 17 '15 at 05:24

score 4 · Accepted Answer · edited Sep 20 '15 at 15:01

If I understand your version of your professor's question correctly, he's trying to teach you the difference between algorithmic order of magnitude and exact execution time.

The exact execution time takes a lot of work to compute.

There is a loop that will have either 10 or 11 instructions, depending on the value of a test. Your best case is that no element of the array a[] is greater than 5 (so nothing has to be incremented); your worst case is that all the elements are greater than 5 (so they all have to be incremented).
There is the pre-loop setup, which takes 1 instruction per element in the array to initialize plus 5 more instructions. Including the extra instruction per element of the array takes the number of instructions per "loop" (really, instructions per element in the array) from [10 or 11] to [11 or 12].
There are the first three instructions of the loop, which perform the loop test and will be executed one time more than the rest of the loop; this makes the extra execution of the instructions overhead, even though they're included in the count of instructions in the loop.
There are the post-loop instructions, 3 more overhead instructions.

When you add them all up, you get your best and worse case equations: 11n + 11 is 11 instructions per array element + 11 overhead; 12n + 11 is 12 instructions per array element + 11 overhead.

On the algorithmic order of magnitude, the code is extremely predictable, so it's easy to compute. Because there are no breaks or multiple passes per input datum, it takes one loop per array element, O(n). If (say) it compared every element to every other element and therefore had to have an inner loop of n elements and an outer loop of n elements, that would be O(n^2). Simple sort routines tend to be like that.

What I believe is the core of the lesson, is the realization that computing exact time of execution is HARD. This little example didn't take into account things like how fast your memory bus is, whether your test data fit into the processor's cache, and a dozen other possible things that could affect the time of execution. Understanding the general order of magnitude is easy by comparison (though on complex algorithms, it can still be hard). And usually the order is all you need. On the rare occasions that you need the detailed analysis, you now understand how to do it... but the rough analysis is good enough for almost anything you'll need to do.

In the normal world of professional development, only specialists generally need to care about instruction-by-instruction timing. Understanding the order of algorithms will let you compare your ideas on a higher level, and get to the right answer in a much more reasonable time.

Hope this helps!

There is a loop that will have either 10 or 11 instructions, --- I don't understand this? In the loop I count 4/5 (the for loop and then the if, count++) what about it makes it 10 or 11? — albert, Sep 17 '15 at 02:04
@albert: "Instructions" is an asm term. You're counting C statements. The C loop compiles to 11 instructions, from `.L4: ` to `jmp .L4`. The `addl $1, -4(%rbp)` can be skipped by a conditional jump: `cmpl $5, %eax / jle .L3`. Ironically, what will make this loop slow or fast is not how many elements are above or below 5. What matters is how randomly the branch flips between taken and not taken on successive iterations. If there's a pattern (e.g. always taken, or alternating), it'll be fast. If random, slow. Branch prediction is important http://stackoverflow.com/a/11227902/224132 — Peter Cordes, Sep 17 '15 at 05:28
That's another of the key concepts that make computing *true* execution time really, REALLY hard, but I didn't want to swamp him with details. :) In grad school, back in the day, we did an analysis of the pipeline depth of a certain mainframe processor. Depending on the instruction, it could take anywhere from 15 to 23 clocks, and that was before branch prediction & potential misses. Actual computing time is easier to measure than determine theoretically. — Erik Johnson, Sep 17 '15 at 18:15

score 1 · Answer 2 · answered Sep 17 '15 at 01:45

Let me rephrase:

Because we both approximate and also ignore actual constants (eg 11*t0), it is easier to reason as follows.

When talking about complexity, we don't care about constants. In other words, a program with 11*n instructions has the same complexity as a program with 12*n instructions, O(n).

The loop is executed n times, and the loop contents takes constant time, so total time is O(n).

The loop is executed n times. In this case, n=length. In each iteration, we evaluate a[i] > 5, and if true we do an addition, and if false we don't. In other words, we either do one instruction or don't. In other words, the body of the loop takes O(1) time. If you do an O(1) task n times, you get an O(n) complexity.

Most of the time we only use a rough analysis, which can be justified by the detailed analysis.

The rough analysis is O(1)*n = O(n). The detailed analysis is counting instructions.

The detailed analysis is when you count instructions in both the best case and the worst case. If you get the same complexity, in our case O(n), then you've shown that your rough analysis is correct.

This program will execute the fewest instructions if a[i] > 5 is always false. Then you get that 11n + 11 number of instructions.

This program will execute the most instructions if a[i] > 5 is always true. Then you get that 12n + 11 number of instructions.

When you drop the constants, both are n, so our estimate of O(n) time is correct.

What does this 'efficiency rating' mean by analysing assembly instructions?

2 Answers2