5

So recently I learned about the perf command in linux. I decided to run some experiments, so I created an empty c program and measured how many instructions it took to run:

echo 'int main(){}'>emptyprogram.c && gcc -O3 emptyprogram.c -o empty
perf stat ./empty

This was the output:

 Performance counter stats for './empty':

      0.341833      task-clock (msec)         #    0.678 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           112      page-faults               #    0.328 M/sec                  
     1,187,561      cycles                    #    3.474 GHz                    
     1,550,924      instructions              #    1.31  insn per cycle         
       293,281      branches                  #  857.966 M/sec                  
         4,942      branch-misses             #    1.69% of all branches        

   0.000504121 seconds time elapsed

Why is it using so many instructions to run a program that does literally nothing? I thought that maybe this was some baseline number of instructions that are necessary to load a program into the OS, so I looked for a minimal executable written in assembly, and I found a 142 byte executable that outputs "Hi World" here (http://timelessname.com/elfbin/)

Running perf stat on the 142 byte hello executable, I get:

Hi World

 Performance counter stats for './hello':

      0.069185      task-clock (msec)         #    0.203 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
             3      page-faults               #    0.043 M/sec                  
       126,942      cycles                    #    1.835 GHz                    
       116,492      instructions              #    0.92  insn per cycle         
        15,585      branches                  #  225.266 M/sec                  
         1,008      branch-misses             #    6.47% of all branches        

   0.000340627 seconds time elapsed

This still seems a lot higher than I'd expect, but we can accept it as a baseline. In that case, why did running empty take 10x more instructions? What did those instructions do? And if they're some sort of overhead, why is there so much variation in overhead between a C program and the helloworld assembly program?

Alecto Irene Perez
  • 10,321
  • 23
  • 46
  • 5
    Your "empty" program does dynamically load shared libraries (at least the clib I guess - I guess you have 64b linux, where by default the binary is PIE and using shared libs), and then initializes the C-runtime environment, then finally it gives control to your empty `main(...)` which just returns. And then it has to release + clean-up everything from the C-runtime, flush the buffers, etc... – Ped7g Dec 09 '17 at 01:25
  • 1
    Can you add the results of `gcc -O3 -static emptyprogram.c -o empty1; perf stat ./empty1` ? – Mark Plotnick Dec 09 '17 at 01:45
  • So by using C instead of assembly you waste 0.00016 seconds each time you run the program? – Bo Persson Dec 09 '17 at 06:20
  • There have been a few previous questions about `perf` instruction counts. [This one](https://stackoverflow.com/questions/26312127/perf-stat-gives-different-number-of-instruction-for-every-run?rq=1) suggests that user vs. kernel+user counts might explain the high counts for a bare static executable. Did you try a static executable that just exits without calling `sys_write`, too? – Peter Cordes Dec 09 '17 at 07:24
  • Thank you I will try those suggestions – Alecto Irene Perez Dec 09 '17 at 08:38
  • 3
    When I looked at empty programs written as C, I found that the dynamic linker `ldd.so` accounted for most of the instruction could as @Ped7g suggests. You could try static linking to see the difference while leaving the C-rutnime init in place. – BeeOnRope Dec 09 '17 at 16:23

1 Answers1

2

It's hardly fair to claim that it "does literally nothing". Yes, at the app level you chose to make the whole thing a giant no-op for your microbenchmark, that's fine. But no, down beneath the covers at the system level, it's hardly "nothing". You asked linux to fork off a brand new execution environment, initialize it, and connect it to the environment. You called very few glibc functions, but dynamic linking is non-trivial and after a million instructions your process was ready to demand fault printf() and friends, and to efficiently bring in libs you might have linked against or dlopen()'ed.

This is not the sort of microbench that implementors are likely to optimize against. What would be of interest is if you can identify "expensive" aspects of fork/exec that in some use cases are never used, and so might be #ifdef'd out (or have their execution short circuited) in very specific situations. Lazy evaluation of resolv.conf is one example of that, where the overhead is never paid by a process if it never interacts with IP servers.

J_H
  • 17,926
  • 4
  • 24
  • 44
  • Related, near duplicate: [Number of executed Instructions different for Hello World program Nasm Assembly and C](https://stackoverflow.com/a/35210404). Also [How do I determine the number of x86 machine instructions executed in a C program?](https://stackoverflow.com/q/54355631) has some explanation of CRT startup and dynamic linking, vs. pure asm that just makes an _exit syscall, with `perf stat` counting 3 instructions for it (instead of the actual 2). – Peter Cordes Feb 08 '21 at 21:22