Assembly code's length can indicate execution speed?

Question

I'm learning C, consider the following code snippet:

#include <stdio.h>

int main(void) {
  int fahr;
  float calc;

  for (fahr = 300; fahr >= 0; fahr = fahr - 20) {
    calc = (5.0 / 9.0) * (fahr - 32);
    printf("%3d %6.1f\n", fahr, calc);
  }

  return 0;
}

Which is printing Celsius to Fahrenheit conversion table from 300 to 0. I compile this with:

$ clang -std=c11 -Wall -g -O3 -march=native main.c -o main

I also generate assembly code with this command:

$ clang -std=c11 -Wall -S -masm=intel -O3 -march=native main.c -o main

Which is generating 1.26kb file and 71 lines.

I slightly edited the code and moved the logic into another function which is initalized at main():

#include <stdio.h>

void foo(void) {
  int fahr;
  float calc;

  for (fahr = 300; fahr >= 0; fahr = fahr - 20) {
    calc = (5.0 / 9.0) * (fahr - 32);
    printf("%3d %6.1f\n", fahr, calc);
  }
}

int main(void) {
  foo();
  return 0;
}

This will generate 2.33kb assembly code with 128 lines.

Running both programs with time ./main I see no difference in execution speed.

My question is, does it matter anything trying to optimize your C programs by assembly code's length?

You don't see an y difference because the only difference between the two programs is **one** call to `foo`. Look at the generated assembly code and don't look only on the size of the executable which is not relevant. BTW most of the CPU time is used by `printf` anyway, rough estimation: 95% — Jabberwocky, Jun 13 '16 at 10:04
If you are learning C, just focus on the language itself, don't worry about optimizations. You can optimize later, when you are more proficient with the langauge, — Giacomo Degli Esposti, Jun 13 '16 at 10:05
@GiacomoDegliEsposti: if you already know how computers work, watching how C compiles into asm will give you a deeper understanding of C, since it's close enough to asm to understand how this happens for simple functions. Most of the time, though, you should force yourself not to think too much about the asm, and just write code that is human-readable and works reliably. — Peter Cordes, Jun 13 '16 at 12:17
fewer instructions dont necessarily indicate speed. many easy to create situations where more instructions is faster than fewer...."it depends". the only way to test for speed is to measure the speed. with experience you may be able to figure out what things take longer than others and be able to tune your high level code for speed. But you first have to start by accurately timing the code, then adjusting things and understanding why those adjustments helped or hurt. for an x86 system with so many variants and variables, hardly worth the time as tuning for one makes another slower. — old_timer, Jun 13 '16 at 16:52

score 12 · Accepted Answer · answered Jun 13 '16 at 11:37

It seems that you are comparing the sizes of the .S files generated by GCC, since that obviously make no sense, I'm just pretending you were confronting the binary size of two, GCC generated, code snippets.

While, having all other conditions the same, a shorter code size may gives an increase in speed (due to an higher code density), in general x86 CPUs are complex enough to require a decoupling between optimizations for code size and optimizations for code speed.

Specifically if you aim at code speed you should optimize for... code speed. Sometime this require choosing the shortest snippet, sometime it doesn't.

Consider the classic example of compiler optimization, multiplication by powers of two:

int i = 4;
i = i * 8;

This may be badly translated as:

;NO optimizations at all

mov eax, 4        ;i = 4        B804000000       0-1 clocks
imul eax, 8       ;i = i * 8    6BC009           3 clocks
                  ;eax = i      8 bytes total    3-4 clocks total

;Slightly optimized
;4*8 gives no sign issue, we can use shl

mov eax, 4        ;i = 4        B804000000       0-1 clocks
shl eax, 3        ;i = i * 8    C1E003           1 clock
                  ;eax = i      8 bytes total    1-2 clocks total

Both snippets have the same code length but the second performs nearly as twice as faster.

This is a very basic example¹, where there is not even much need to take the micro-architecture into account.

Another more subtle example is the following, taken from Agner Fog discussion of Partial register stalls²:

;Version A                        Version B

mov al, byte ptr [mem8]           movzx ebx, byte ptr [mem8]
mov ebx, eax                      and eax, 0ffffff00h
                                  or ebx, eax

;7 bytes                           14 bytes

Both versions give the same result but Version B is 5-6 clocks faster than Version A despite the former being twice the size of the latter.

The answer is then no, code size is not enough; it may be a tie-breaker though.

If you really are interested into optimizing assembly you will enjoy these two readings:

The first link also have a manual to optimize C and C++ code.

If you write in C remember that the most impacting optimizations are 1) How data is represented/stored, i.e. Data structures 2) How data is processed, i.e. Algorithms.
There are the macro optimizations.

Taking into account the generated assembly is shifting into micro optimization and there the most useful tools are 1) A smart compiler 2) A good set of intrinsics³.

¹ So simple to be optimized out in practice.
² Maybe a little obsolete now but it serves the purpose.
³ Built-in, non standard, functions that translate into specific assembly instructions.

score 3 · Answer 2 · answered Jun 13 '16 at 11:02

As always, the answer is "it depends". Sometimes making code longer makes it more efficient: for example, the CPU doesn't have to waste extra instructions jumping after every loop. A classic example (literally 'classic': 1983!) is "Duff's Device". The following code

register short *to, *from;
register count;
{
    do {                          /* count > 0 assumed */
        *to = *from++;
    } while(--count > 0);
}

was made much faster by using this much larger, and more complicated, code:

register short *to, *from;
register count;
{
    register n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7:      *to = *from++;
    case 6:      *to = *from++;
    case 5:      *to = *from++;
    case 4:      *to = *from++;
    case 3:      *to = *from++;
    case 2:      *to = *from++;
    case 1:      *to = *from++;
            } while (--n > 0);
    }
}

But that can be taken to extremes: making code too large increases cache misses and all sorts of other issues. In short: "premature optimisation is evil" - you need to test your before-and-after, and often on multiple platforms, before deciding that it is a good idea.

And I'll ask you: is the second version of the above code "better" than the first version? It's less readable, less maintainable, and far more complex than the code it replaces.

IDK why people love to trot out Duff's Device instead of a simple example of loop unrolling with separate prologue/epilogue loops. (Or `gcc -funroll-loops` to have the compiler do it for you). The straight-forward way to compile that loop with 8 entry points loses a lot of the benefit of unrolling on an architecture like x86 without auto-increment addressing modes. (i.e. separate `inc` instructions for each step, instead of one increment and then displacements in the addressing mode.) — Peter Cordes, Jun 13 '16 at 11:54
@Peter I trot out Duff's Device as a counter-example. It's (at first) horrendously complicated to newbies; it's a massively egregious use of C; it's ultimately understandable to most programmers; and it compiles into something not much more than a simple loop-unroll despite its complexity. I deliberately choose it to show why you should **NOT** optimise - and yet I still love Michael Abrash's book "Zen of Code Optimization" ISBN 1-883577-03-9 — John Burger, Jun 13 '16 at 12:08
My point is that loop unrolling is a very good optimization for tiny loops, and can easily be done without Duff's Device. So it's a straw-man argument against optimization. But yeah, having a simple implementation as a baseline for perf improvements is essential. You can't know if your "clever" idea is a win or not without something to compare against. — Peter Cordes, Jun 13 '16 at 12:13

score 2 · Answer 3 · edited May 23 '17 at 12:17

The code that actually runs is the same in both cases, after inlining. The 2nd way is bigger because it also has to emit a stand-alone definition of the function, not inlined into main.

You would have avoided this if you'd used static on the function, so the compiler would know that nothing could call it from outside the compilation unit, and thus a stand-alone definition wasn't needed if it was inlined into its only caller.

Also, most of the .s lines in compiler output are comments or assembler directives, not instructions. So you're not even counting instructions.

The Godbolt compiler explorer is a good way to look at compiler asm output, with just the instructions and actually-used labels. Have a look at your code there.

Counting the total number of instructions in the executable is totally bogus if there are loops or branches. Or especially function calls inside loops, like in this case. Dynamic instruction count (how many instructions actually ran, i.e. counting each time through loops and so on) is very roughly correlated with performance, but some code runs at 4 instructions per cycle, while some runs at well below 1 (e.g. lots of div or sqrt, cache misses, and/or branch mispredicts).

To learn more about what makes code run slow or fast, see the x86 tag wiki, especially Agner Fog's stuff.

I also recently wrote an answer to Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. Thinking of diabolical ways to make a program run slower is a fun exercise.

Assembly code's length can indicate execution speed?

3 Answers3

Linked