Functions in our code make it slower?

Question

When we compile code and execute it, in assembly, in which our code gets converted, functions are stored in a non sequential manner. So every time a function is called, the processor needs to throw away the instructions in the pipeline. Doesn't this affect the performance of the program?

PS: I'm not considering the time invested in developing such programs without functions. Purely on the performance level. Are there any ways in which compilers deal with this to reduce it?

usually jumping to functions and back isn't the problem, while jumping on conditions is ("branches") ... like each time you use if/while/for/etc... here's a pretty good posting about this: http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array/11227902#11227902 — Tommylee2k, May 18 '16 at 08:11
In modern processors, branch misprediction is often cheaper than cache misses. While a compiler can prevent braches (and thus mispredictions) by inlining functions and unrolling loops, this greatly increases code size. While there used to be ways to help the branch predictor, [they no longer work](http://stackoverflow.com/a/1851905). — , May 18 '16 at 08:29
The front end should be redirected a while before the back end notices anything, but of course if there is a cache miss due to the non-locality that mechanism is defeated a bit. On the other hand, inlining a lot causes capacity misses, so there's a trade-off there. — harold, May 18 '16 at 09:57
What kind of level of detail are you looking for in an explanation of how instruction caches work? A link to a good wikipedia page for background? I don't want to bloat my answer with a lot of generic how-caches-work stuff; it already covers a lot of ground. Do you also require an explanation of how queues between pipeline stages hide bubbles? — Peter Cordes, May 21 '16 at 13:19
Hey @PeterCordes It'd be great if you could point me to some links that can let me understand your answer better. Right now your answer is overwhelming for me, since I asked this question after studying about Intel 8086 and ARM processors fundamentals and pipelining in it in my class. — ayushgp, May 22 '16 at 05:00
@ayushgp: getting back to this answer and making some edits was on my to-do list. You awarded the bounty, though, so does that mean you managed to research stuff on your own (e.g. from Agner Fog's excellent guides)? If you found any other resources really helpful, feel free to suggest an edit to add links or extra text to my answer, or just leave a comment and I'll think about how to incorporate it. I have a hard time knowing exactly what's confusing for a beginner, since I've understood caches and pipelines for a while now. Your perspective could help make this more useful for others. — Peter Cordes, May 23 '16 at 17:55
@PeterCordes seeing that many other people agreeing with you, I awarded you the bounty. I have started reading from Agner Fog's guides, but they too are somewhat difficult for a beginner. But I'm slowly starting to get a hang of these topics. But finding good material to read in addition to those mentioned by you, about these areas is gruelling task. If you too could provide some more links, that I could understand, it'd be quite helpful. Especially for caches, maybe a book or something similar to Fog's guides. Thanks! — ayushgp, May 24 '16 at 10:17
@ayushgp: A bounty is your rep to award as you see fit. The criteria for awarding don't have to be the same as for an accepting answer. If I were you, I'd have held off on awarding it until you got me to add some beginner links to the answer. :P Luckily, some searching found that Wikipedia's article on the classic RISC pipeline explains this in detail, with diagrams, so it didn't take too much extra work to expand my answer to deserve the bounty you awarded. I also found a couple other things to link. — Peter Cordes, May 24 '16 at 11:03

Peter Cordes · Accepted Answer · 2016-05-24T11:01:11.627

So every time a function is called, the processor needs to throw away the instructions in the pipeline.

No, everything after the decode stage is still good. The CPU knows not to keep decoding after an unconditional branch (like a jmp, call, or ret). Only the instructions that have been fetched but not yet decoded are ones that shouldn't run. Until the target address is decoded from the instruction, there's nothing useful for beginning of the pipeline to do, so you get bubbles in the pipeline until the target address is known. Decoding branch instructions as early as possible thus minimizes the penalty for taken branches.

In the classic RISC pipeline, the stages are IF ID EX MEM WB (fetch, decode, execute, mem, write-back (results to registers). So when ID decodes a branch instruction, the pipeline throws away the instruction currently being fetched in IF, and the instruction currently being decoded in ID (because it's the instruction after the branch).

"Hazard" is the term for things that prevent a steady stream of instructions from going through the pipeline at one per clock. Branches are a Control Hazard. (Control as in flow-control, as opposed to data.)

If the branch target isn't in L1 I-cache, the pipeline will have to wait for instructions to streaming in from memory before the IF pipeline stage can produce a fetched instruction. I-cache misses always create a pipeline bubble. Prefetching usually avoids this for non-branching code.

More complex CPUs decode far enough ahead to detect branches and re-steer fetch soon enough to hide this bubble. This may involve a queue of decoded instructions to hide the fetch bubble.

Also, instead of actually decoding to detect branch instructions, the CPU can check every instruction address against a "Branch Target Buffer" cache. If you get a hit, you know the instruction is a branch even though you haven't decoded it yet. The BTB also holds the target address, so you can start fetching from there right away (if it's an unconditional branch or your CPU supports speculative execution based on branch prediction).

ret is actually the harder case: the return address is in a register or on the stack, not encoded directly into the instruction. It's an unconditional indirect branch. Modern x86 CPUs maintain an internal return-address predictor stack, and perform very badly when you mis-match call/ret instructions. E.g. call label / label: pop ebx is terrible for position-independent 32bit code to get EIP into EBX. That will cause a mis-predict for the next 15 or so rets up the call tree.

I think I've read that a return-address predictor stack is used by some other non-x86 microarchitectures.

See Agner Fog's microarchitecture pdf to learn more about how x86 CPUs behave (also see the x86 tag wiki), or read a computer architecture textbook to learn about simple RISC pipelines.

For more about caches and memory (mostly focused on data caching / prefetching), see Ulrich Drepper's What Every Programmer Should Know About Memory.

An unconditional branch is quite cheap, like usually a couple cycles at worst (not including I-cache misses).

The big cost of a function call is when the compiler can't see the definition of the target function, and has to assume it clobbers all the call-clobbered registers in the calling convention. (In x86-64 SystemV, all the float/vector registers, and about 8 integer registers.) This requires either spilling to memory or keeping live data in call-preserved registers. But that means the function has to save/restore those register to not break the caller.

Inter-procedural optimization to let functions take advantage of knowing which registers other functions actually clobber, and which they don't, is something compilers can do within the same compilation unit. Or even across compilation units with link-time whole-program optimization. But it can't extend across dynamic-linking boundaries, because the compiler isn't allowed to make code that will break with a differently-compiled version of the same shared library.

Are there any ways in which compilers deal with this to reduce it?

They inline small functions, or even large static functions that are only called once.

e.g.

int foo(void) { return 1; }
    mov     eax, 1    #,
    ret

int bar(int x) { return foo() + x;} 
    lea     eax, [rdi+1]      # D.2839,
    ret

As @harold points out, overdoing it with inlining can cause cache misses, too, because it inflates your code size so much that not all of your hot code fits in cache.

Intel SnB-family designs have a small but very fast uop cache that caches decoded instructions. It only holds at most 1536 uops IIRC, in lines of 6 uops each. Executing from uop cache instead of from the decoders shortens the branch-mispredict penalty from 19 to 15 cycles, IIRC (something like that, but those numbers are probably not actually correct for any specific uarch). There's also a significant frontend throughput boost compared to the decoders, esp. for long instructions which are common in vector code.

Functions in our code make it slower?

1 Answers1