Why C/C++ is slower than Assembly and other low level languages?

Question

I write a code, doing nothing in C++

void main(void){

}

and Assembly.

.global _start
.text

_start:
    mov $60, %rax
    xor %rdi, %rdi 
    syscall

I compile the C code and compile and link Assembly code. I make a comparison between two executable file with time command.

Assembly

time ./Assembly

real    0m0.001s
user    0m0.000s
sys     0m0.000s

C

time ./C

real    0m0.002s
user    0m0.000s
sys     0m0.000s

Assembly is two times faster than C. I disassemble the codes, in Assembly code, there was only four lines code (Same). In the C code, there was tons of unnecessary code writed for linking main to _start. In main there was four lines code, three of that is writed for making impossible (you can't access to a function's variable from outside of the function blog) to access 'local' (like function veriables) variables from outside of 'block' (like function blocks).

push %rbp ; push base pointer.
mov  %rsp, %rbp ; copy value of stack pointer to base pointer, stack pointer is using for saving variables.
pop  %rbp ; 'local' variables are removed, because we pop the base pointer 
retq ; ?

What is why of that?

Before your C program even gets to `main` it needs to load and initialize the C library and do a bunch of other stuff. You could change your assembly version to use `main` too, and link it with the C library, so you get a more fair comparison. Alternatively, you could compile the C version as standalone code. — Jester, Jun 20 '16 at 21:43
Because usually it is not critical to be as fast as possible when doing nothing. — EOF, Jun 20 '16 at 21:44
Your C++ example is not even legal in the first place. Also what EOF said, no one cares how long nonsense takes. — Baum mit Augen, Jun 20 '16 at 21:45
Theoretically It could be just as fast. File a bug report to your C/C++ compiler vendor. (Please don't, I'm being sarcastic). — Karoly Horvath, Jun 20 '16 at 21:49
I dunno...the longer I can spend doing nothing the better IMO. — Edward Strange, Jun 20 '16 at 21:49
That extra `0.001s` of time that it takes to start my C++ application is going to be a killer. Gotta start writing in assembly language. — PaulMcKenzie, Jun 20 '16 at 21:54
@JoseManuelAbarcaRodríguez: The perverse thing is how upvoted the trivial answers to the most downvoted questions tend to be. — EOF, Jun 20 '16 at 22:04
The question is a bit like asking why it takes longer to get your car out of the garage and drive to the corner shop, than it does to walk. — Weather Vane, Jun 20 '16 at 22:10
So how do you compile this nonsense? C or C++? Please first read about proper benchmarking, there is more to doing it correctly than taking the time to load/execute/terminate it. — too honest for this site, Jun 20 '16 at 22:10
@EOF: I'm surprised this got so heavily downvoted. Even though the OP's conclusions and methodology are completely bogus, there's an actual question here with an interesting answer: dynamic linking, and what the CRT boilerplate code is for. — Peter Cordes, Jun 21 '16 at 00:00
@PeterCordes: I find this question quite terrible, and the fact that you can write a nice bit of text doesn't redeem the question. If you set me off on the right topic I might not need much prodding to rave on about it, but that doesn't make the question any less lazy and misguided. — EOF, Jun 21 '16 at 00:07
@EOF: I guess I don't see it as much worse than some of the other dumb questions that get upvotes, and involve serious misconceptions. I wish this had never been asked at all, instead of being asked and then getting a not-very-accurate answer that misses the main reason for the observed behaviour. (Although the accepted answer does at least correct the misunderstanding that C is 2x slower than asm). Inaccurate answers make me crazy. — Peter Cordes, Jun 21 '16 at 00:12

templatetypedef · Accepted Answer · 2020-01-24T15:17:13.743

The amount of time required to execute the core of your program you've written is incredibly small. Figure that it consists of three or four assembly instructions, and at several gigahertz that will only require a couple of nanoseconds to run. That's such a small amount of time that it's vastly below the detection threshold for the time program, whose resolution is measured in milliseconds (remember that a millisecond is a million times slower than a nanosecond!) So in that sense, I would be very careful about making judgments about the runtime of one program as being "twice as fast" as the other; the resolution of your timer isn't high enough to say that for certain. You might just be seeing noise terms.

Your question, though, was why there is all this automatically generated code if nothing is going to happen. The answer is "it depends." With no optimization turned on, most compilers generate assembly code that faithfully simulates the program you wrote, possibly doing more work than is necessary. Since most C and C++ functions, you actually will have code that does something, will need local variables, etc., a compiler wouldn't be too wrong in emitting code at the start and end of a function to set up the stack and frame pointer properly to support those variables. With optimization turned up to the max, an optimizing compiler might be smart enough to notice that this isn't necessary and to remove that code, but it's not required.

In principle, a perfect compiler would always emit the fastest code possible, but it turns out that it's impossible to build a compiler that will always do this (this has to do with things like the undecidability of the halting problem). Therefore, it's somewhat assumed that the code generated will be good - even great - but not optimal. However, it's a tradeoff. Yes, the code might not be as fast as it could possibly be, but by working in languages like C and C++ it's possible to write large and complex programs in a way that's (compared to assembly) easy to read, easy to write, and easy to maintain. We're okay with the slight performance hit because in practice it's not too bad and most optimizing compilers are good enough to make the price negligible (or even negative, if the optimizing compiler finds a better approach to solving a problem than the human!)

To summarize:

Your timing mechanism is probably not sufficient to make the conclusions that you're making. You'll need a higher-precision timer than that.
Compilers often generate unnecessary code in the interest of simplicity. Optimizing compilers often remove that code, but can't always.
We're okay paying the cost of using higher-level languages in terms of raw runtime because of the ease of development. In fact, it might actually be a net win to use a high-level language with a good optimizing compiler, since it offloads the optimization complexity.

Most of the difference is dynamic vs. static linking, and no CRT overhead in the asm version. The non-optimizing compile options aren't a factor here (maybe ~1 extra cycle to issue those `push`/`mov`/`pop` instructions, and they're not on the critical path.) — Peter Cordes, Jun 20 '16 at 23:39
@PeterCordes That's a great point and I completely agree. I'm a bit of a theoretician and so I was folding that under the umbrella of "things an optimizing compiler should get rid of if it were smart enough," though you're correct that the loading cost is a big one. — templatetypedef, Jun 21 '16 at 00:55
That's true, an optimizing compiler *could* check that the program does nothing except exit (potentially with a status based on its args), and omit libc in that case. This is basically a useless optimization that just saves some additive overhead, not multiplicative, so it's not worth adding code to gcc to detect that. — Peter Cordes, Jun 21 '16 at 01:30

score 4 · Answer 2 · edited May 23 '17 at 11:47

All the extra time from C is dynamic linker and CRT overhead. The asm program is statically linked, and just calls exit(2) (the sycall directly, not the glibc wrapper). Of course it's faster, but it's just startup overhead and doesn't tell you anything about how fast compiler-emitted code that actually does anything will run.

i.e. if you wrote some code to actually do something in C, and compiled it with gcc -O3 -march=native, you'd expect it to be ~0.001 seconds slower than a statically linked binary with no CRT overhead. (If the your hand-written asm and the compiler output were both near-optimal. e.g. if you used the compiler output as a starting point for a hand-optimized version, but didn't find anything major. It's usually possible to make some improvements to compiler output, but often just to code-size and probably not much effect on speed.)

If you want to call malloc or printf, then the startup overhead is not useless; it's actually necessary to initialize glibc internal data structures so that library functions don't have any overhead of checking that stuff is initialized every time they're called.

From a statically linked hand-written asm program that links glibc, you need to call __libc_init_first, __dl_tls_setup, and __libc_csu_init, in that order, before you can safely use all libc functions.

Anyway, ideally you can expect a constant time difference from the startup overhead, not a factor of 2 difference.

If you're good at writing optimal asm, you can usually do a better job than the compiler on a local scale, but compilers are really good at global optimizations. Moreover, they do it in seconds of CPU time (very cheap) instead of weeks of human effort (very precious).

It can make sense to hand-craft a critical loop, e.g. as part of a video encoder, but even video encoders (like x264, x264, and vpx) have most of the logic written in C or C++, and just call asm functions.

The extra push/mov/pop instructions are because you compiled with optimization disabled, where -fno-omit-frame-pointer is the default, and makes a stack frame even for leaf functions. gcc defaults to -fomit-frame-pointer at -O1 and higher on x86 and x86-64 (since modern debug metadata formats mean it's not needed for debugging or exception-handling stack unwinding).

If you'd told your C compiler to make fast code (-O3), instead of to compile quickly and make dumb code that works well in a debugger (-O0), you would have gotten code like this for main (from the Godbolt compiler explorer):

// this is valid C++ and C99, but C89 doesn't have an implicit return 0 in main.  
int main(void) {}

    xor     eax, eax
    ret

To learn more about assembly and how everything works, have a look at some of the links in the x86 tag wiki. Perhaps Programming From the Ground Up would be a good start; it probably explains compilers and dynamic linking.

A much shorter article is A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux, which starts with what you did, and then gets down to having _start overlap with some other ELF headers so the file can be even smaller.

Well, if you're determined to take this question serious, I'd recommend posting the output of `strace` for the null-program compiled in gcc and linked with glibc. That'll show just *how much* extra work the sane version is doing. — EOF, Jun 21 '16 at 00:17
@EOF: Great suggestion, but I'll leave that for the OP to do himself. `strace /bin/true` is basically the same thing, since even GNU true doesn't use any system calls in parsing its args (yes, it really supports `--help` and `--version) — Peter Cordes, Jun 21 '16 at 00:23

score 0 · Answer 3 · answered Jun 20 '16 at 21:58

0

Did you compile with optimizations enabled? If not, then this is invalid.
Did you consider that this is a completely trivial example that will have no real-life performance implications worth writing even a postcard about?

Please write clear maintainable code and (in 99% of cases) leave the optimization to the compiler. Please.

answered Jun 20 '16 at 21:58

Jesper Juhl

30,449
3
47
70

Obviously he didn't have optimizations enabled, but that's not why. Much bigger is statically linked with no CRT vs. dynamically linked + CRT startup. – Peter Cordes Jun 20 '16 at 23:38
3

it's like comparing the time to jump onto the bike to the time you need to walk to the car's parking place + opening it + taking a seat in it, and then concluding that "driving by bike is faster than by car" which is nonsense – Tommylee2k Jun 21 '16 at 09:03

Why C/C++ is slower than Assembly and other low level languages?

3 Answers3