Cost of push vs. mov (stack vs. near memory), and the overhead of function calls

Question

Question:

Is accessing the stack the same speed as accessing memory?

For example, I could choose to do some work within the stack, or I could do work directly with a labelled location in memory.

So, specifically: is push ax the same speed as mov [bx], ax? Likewise is pop ax the same speed as mov ax, [bx]? (assume bx holds a location in near memory.)

Motivation for Question:

It is common in C to discourage trivial functions that take parameters.

I've always thought that is because not only must the parameters get pushed onto the stack and then popped off the stack once the function returns, but also because the function call itself must preserve the CPU's context, which means more stack usage.

But assuming one knows the answer to the headlined question, it should be possible to quantify the overhead that the function uses to set itself up (push / pop / preserve context, etc.) in terms of an equivalent number of direct memory accesses. Hence the headlined question.

(Edit: Clarification: near used above is as opposed to far in the segmented memory model of 16-bit x86 architecture.)

Wow. I'm an explorer. I just found a good, non-n00b question on StackOverflow. Celebrating my exploration with champagne and an upvote! — , Oct 07 '12 at 06:11
I always considered push/pop call's decrement/increment operations on ESP as an overhead as compared to mov.... but I guess there should be lot more to it. — loxxy, Oct 07 '12 at 06:30

score 20 · Accepted Answer · answered Oct 07 '12 at 06:49

20

Nowadays your C compiler can outsmart you. It may inline simple functions and if it does that, there will be no function call or return and, perhaps, there will be no additional stack manipulations related to passing and accessing formal function parameters (or an equivalent operation when the function is inlined but the available registers are exhausted) if everything can be done in registers or, better yet, if the result is a constant value and the compiler can see that and take advantage of it.

Function calls themselves can be relatively cheap (but not necessarily zero-cost) on modern CPUs, if they're repeated and if there's a separate instruction cache and various predicting mechanisms, helping with efficient code execution.

Other than that, I'd expect the performance implications of the choice "local var vs global var" to depend on the memory usage patterns. If there's a memory cache in the CPU, the stack is likely to be in that cache, unless you allocate and deallocate large arrays or structures on it or have deep function calls or deep recursion, causing cache misses. If the global variable of interest is accessed often or if its neighbors are accessed often, I'd expect that variable to be in the cache most of the time as well. Again, if you're accessing large spans of memory that can't fit into the cache, you'll have cache misses and possibly reduced performance (possibly because there may or may not be a better, cache-friendly way of doing what you want to do).

If the hardware is pretty dumb (no or small caches, no prediction, no instruction reordering, no speculative execution, nothing), clearly you want to reduce the memory pressure and the number of function calls because each and everyone will count.

Yet another factor is instruction length and decoding. Instructions to access an on-stack location (relative to the stack pointer) can be shorter than instructions to access an arbitrary memory location at a given address. Shorter instructions may be decoded and executed faster.

I'd say there's no definitive answer for all cases because performance depends on:

your hardware
your compiler
your program and its memory accessing patterns

answered Oct 07 '12 at 06:49

Alexey Frunze

61,140
12
83
180

Thanks Alexey - good point about local var (stack, correct?) vs. global var (memory, correct?) -- hadn't thought of it that way. – Assad Ebrahim Oct 07 '12 at 07:12
Re: arbitrary memory location -- that's why I'm restricting the consideration to `near` memory. Does this make a difference? – Assad Ebrahim Oct 07 '12 at 07:14
Re: your point about varying instruction length & decoding time -- do you mean a difference between for example `mov [bx], ax` vs. `mov [loc], ax`, assuming `loc equ 0xfffd` (or some near offset)? (Thanks, as always, for your really great answers!!) – Assad Ebrahim Oct 07 '12 at 07:19
1

Taking x86 as an example the 16-bit compilers typically pushed all parameters onto the stack, called, created a stack frame, executed, removed the stack frame and returned. For the 32- and 64-bit compilers the first few parameters are passed in registers, call made, stack frame may or may not be created/removed. The only thing the compiler needs to conform to is the run-time environment: what happens inside is up to the compiler's magic to handle, and some compilers are really fantastic. – Olof Forshell Oct 07 '12 at 07:35
2

@AKE: Yes, `mov [bx], ax` will be shorter than `mov [0x1234], ax`. – Alexey Frunze Oct 07 '12 at 07:48
@AlexeyFrunze: I believe `near` memory is just an offset on the same page of memory. (As opposed to `far` memory) So in principle it should be reachable without having to change any of the segment pointers. – Assad Ebrahim Oct 07 '12 at 07:54
@AKE What `same` page? Same with `what`? Segment registers are unrelated to page translation. – Alexey Frunze Oct 07 '12 at 07:54
@AlexeyFrunze: Agree `mov [bx], ax` is a shorter instruction (by 1 byte). But would this instruction execute faster than `mov [0x1234], ax` given that the offset for both instructions is the same, and the only difference is whether it is stored in register or given as immediate? (For reference, out of `debug`: the machine code for the first instruction is `8907h` whereas for `mov [0x1234], ax` it is `A3 3412h`.) – Assad Ebrahim Oct 07 '12 at 08:02
@AlexeyFrunze: re: `near` and `far` -- I'm referring to segmented memory model. See here for description of `far` pointer needing segment selector. http://en.wikipedia.org/wiki/Far_pointer – Assad Ebrahim Oct 07 '12 at 08:05
1

It may or it may not be faster. It may be the same when executed alone (which is quite hard to do, right?:), but it may be faster depending on the surrounding instructions. What if that single byte is enough to cause a cache miss? – Alexey Frunze Oct 07 '12 at 08:05
Ah! -- so these slight differences become immaterial if your hardware is not "dumb" (to your point in the answer) and most modern hardware "is smart" has all of the non-deterministic optimisation fanciness. Ok, I see. Thanks! – Assad Ebrahim Oct 07 '12 at 08:10
1

@AKE You should have clarified that. Some never programmed for 16-bit DOS and wouldn't have any idea what near or far could possibly mean. I decided not to make an assumption here. There may or may not be penalty for using a segment override prefix. The CPU caches inside of itself enough information about the segment descriptors for the segments to which cs,ds,es,fs,gs,ss point, so, if the segment register is already loaded with the right value, there may be no difference. – Alexey Frunze Oct 07 '12 at 08:10
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/17667/discussion-between-ake-and-alexey-frunze) – Assad Ebrahim Oct 07 '12 at 08:10
1

On the x86 near memory is memory accessed through the default segment registers cs for execution, ds for data and ss for the stack. Some instructions also use es as a default register. Any time that a code accesses with an explicit (fetching data from memory != default memory with a segment register override) or implicit segment register (far calls) you are working far. To sum it up near is intra-segment and far is inter-segment. – Olof Forshell Oct 07 '12 at 08:13
@OlofForshell I know how that works, I just wanted to clarify it's that thing or something else. – Alexey Frunze Oct 07 '12 at 08:14

score 14 · Answer 2 · edited May 23 '17 at 12:00

For the clock-cycle-curious...

For those who would like to see specific clock cycles, instruction / latency tables for a variety of modern x86 and x86-64 CPUs are available here (thanks to hirschhornsalz for pointing these out).

You then get, on a Pentium 4 chip:

push ax and mov [bx], ax (red boxed) are virtually identical in their efficiency with identical latencies and throughputs.
pop ax and mov ax, [bx] (blue boxed) are similarly efficient, with identical throughputs despite mov ax, [bx] having twice the latency of pop ax

Pentium 4 Instruction Timing Table

As far as the follow-on question in the comments (3rd comment):

indirect addressing (i.e. mov [bx], ax) is not materially different than direct addressing (i.e. mov [loc], ax), where loc is a variable holding an immediate value, e.g. loc equ 0xfffd.

Conclusion: Combine this with Alexey's thorough answer, and there's a pretty solid case for the efficiency of using the stack and letting the compiler decide when a function should be inlined.

(Side note: In fact, even as far back as the 8086 from 1978, using the stack was still not less efficient than corresponding mov's to memory as can be seen from these old 8086 instruction timing tables.)

Understanding Latency & Throughput

A bit more may be needed to understand timing tables for modern CPUs. These should help:

definitions of latency and throughput
a useful analogy for latency and throughput, and their relation to instruction processing pipelines)

Cost of push vs. mov (stack vs. near memory), and the overhead of function calls

2 Answers2