to how many subroutines can x86 processors call?

Question

i'm writing a small program to print a polygon with printf("\219") to just see if whatever i'm doing is right for my kernel. but it needs to call many functions and i don't know whether x86 processors can accept that many subroutines and i can't find results in google. so my question is will it accept so many function calls and what is the maximum. (i mean something like this:-)

function a() {b();}
function b() {c();}
function c() {d();}
...

i've used 5 such levels (you know what i mean, right?)

code you write is not instructions for your cpu. Though this also isnt c++ code. You may want to read about function inlining — 463035818_is_not_an_ai, Sep 14 '21 at 08:27
https://isocpp.org/wiki/faq/inline-functions, but I am not sure if you are really asking about c++.... (and understanding `inline` in c++ requires to go through some history stuff) — 463035818_is_not_an_ai, Sep 14 '21 at 08:31
Inlining a function in C++ does not guarantee that it will be inlined in assembly. — Daniel Kleinstein, Sep 14 '21 at 08:35
I think a modern x86 (not the old 8086) can handle more functions then you can type in a few hours. Should be fine :) — Pepijn Kramer, Sep 14 '21 at 08:35
If you use a "bad" recursion then you can cause a stack overflow and see how many function call frames are on the stack. I've seen +1000, but this limit depends on so many factors that it can vary from 10 to ~50000 even on same machine. — Marek R, Sep 14 '21 at 08:37
It varies by platform. I expect it to be in the 1000+ range, but I have worked on a platform that was limited to 127. So the answer is "it depends". Programs that get in trouble usually have a bug that causes a calling sequence that repeats indefinitely or single function recursion that exhausts the available stack. As per the namesake of this site: a **stack overflow**. — Eljay, Sep 14 '21 at 15:12
On a typical x86 machine with a modern operating system, a reasonable rule of thumb is "thousands yes, hundreds of thousands no". — Nate Eldredge, Sep 14 '21 at 19:41

Daniel Kleinstein · Accepted Answer · 2021-09-14T08:45:17.320

6

Your function depth is not limited by the processor, but by the size of your stack. Behind the scenes, calls to C++ functions usually translate to call instructions in x86, which push four (or eight, for x64 programs) bytes onto your program's stack for the return pointer. Sometimes calls are optimized and don't touch the stack at all. Functions might also push additional bytes (e.g. local function state) onto the stack.

To get the exact number of functions you can call, you need to disassemble your code to calculate the number of bytes each function pushes to the stack (minimum four/eight because of the return address, but likely many more), then find the maximum stack size and divide it by the function frame size.

edited Sep 14 '21 at 08:45

answered Sep 14 '21 at 08:31

Daniel Kleinstein

5,262
1
22
39

He doesn't have a recursion :P – Marek R Sep 14 '21 at 08:33
@MarekR Woops, fixed – Daniel Kleinstein Sep 14 '21 at 08:34
1

fwiw, there is not necessarily a single `call` instruction for OPs example: https://godbolt.org/z/xjabadoKr. Not *every* call to a function in C++ code translates to a `call` instruction – 463035818_is_not_an_ai Sep 14 '21 at 08:35
@463035818_is_not_a_number You're right of course, I changed the wording a bit – Daniel Kleinstein Sep 14 '21 at 08:42
@463035818_is_not_a_number: Right, to stop it from inlining or optimizing them into `jmp` tailcalls, you can use `-O1` to compile, and `__attribute__((noipa))` on each function to stop the optimizer from looking inside it while optimizing other functions. See my answer. (I started working on it before this answer appeared; maybe wouldn't have started if this was already up, but oh well.) – Peter Cordes Sep 14 '21 at 08:56
@PeterCordes Maybe I wouldn't have bothered writing an answer if I knew that Peter Cordes was working on one at the same time :) (though it's a fair guess for any x86-tagged question) – Daniel Kleinstein Sep 14 '21 at 09:01
@DanielKleinstein: I skip a lot of super-beginner ones these days; I could easily have gone either way on deciding to answer this. Although I do still tend to look at and answer questions about concepts and fundamentals, much moreso than debugging questions. What caught my interest here was seeing if I could get GCC to actually emit the asm I wanted for the examples I had in mind. I like your answer, gets the key points across concisely. (And I think my answer is a nice supplement, for folks who want to see more nitty-gritty detail.) Since yours exists, I don't have to polish mine as much – Peter Cordes Sep 14 '21 at 09:25

Peter Cordes · Answer 2 · 2021-09-14T09:21:37.060

call and ret instructions aren't special; the CPU doesn't know how deeply nested it is. (And doesn't know the difference between calling another function or being recursive.) As described in my answer here, functions are a high-level concept that asm gives you the tools to implement.

All the CPU knows is whether pushing a return address causes a page fault or not, if you run out of stack space. (A "stack overflow"). Usually the result of recursion going too deep, or a huge array in automatic storage (a C++ local var) or alloca.

(As mentioned in comments, not every C function call results in a call instruction; inlining can fully optimize away the fact that it's a separate function. You want small functions to inline, and the design of the template classes in the C++ standard library depends on this for efficiency. Also, a tailcall can be just a jmp, having the next function take over this stack space instead of getting new space.)

Linux typically uses 8 MiB stacks for user-space processes. (ulimit -s). Windows is typically 1 MiB. Inside a kernel, you often use smaller stacks. e.g. Linux kernel thread-stacks are currently 16 kiB for x86-64. Previously as small as 4 kiB (one page) for 32-bit x86, but some code has more local vars that take up stack space.

Related: How does the stack work in assembly language?. When ret executes, it just pops into the program counter. It's up to the programmer (or compiler) to create asm that runs ret when the stack pointer is pointing at somewhere you want to jump. (Typically a return address.)

In modern Linux systems, the smallest stack frame is 16 bytes, because the ABI specifies maintaining 16-byte stack alignment before a call. So best case you can have a call depth of 512k before you overflow the stack. (Unless you're in a thread that was started with an extra large thread-stack).

If you're in 32-bit mode with an old version of the i386 System V ABI that only required 4-byte stack alignment (like gcc -mpreferred-stack-boundary=2 instead of the default 4), with functions that just called without using any other stack space, 8 MiB of stack would give you 8 MiB / 4B = 2 Mi call depth.

In real life, some of that 8MiB space on the main thread's stack is used up by env vars and argv[] already on the stack at the process entry point (copied there by the kernel), and then a bit more as _start calls a function that calls main.

To make this able to actually return at the end, instead of just eventually faulting, you'd need either a huge chain, or some recursion with a termination condition, like

void recurse(int n) {
   if (n == 1)
       return;
   recurse(n - 1);
}

and compile with some optimization but not enough to get the compiler to turn it into a do{}while(--n); loop or optimize away.

If you wanted all different functions, that's ok, the code-size limit is at least 2GiB, and call rel32 / ret takes a total of 6 bytes. (Un-optimized code would default to push ebp or push rbp as well, so you'd have to avoid that for 32-bit code if you wanted to meet that 4-byte stack-frame goal of having all your stack space full with just return addresses).

For example with GCC (see How to remove "noise" from GCC/clang assembly output? for the __attribute__((noipa)) option)

__attribute__((noipa))  // don't let other functions even notice that this is empty, let alone inline it
void foo(void){}

__attribute__((noipa))
void bar(){
    foo();
}

void baz(){
    bar();
}

compiles with GCC (Godbolt compiler explorer) to this 32-bit asm which just calls without using any stack space for anything else:

## GCC11.2  -O1 -m32 -mpreferred-stack-boundary=2
foo():
        ret
bar():
        call    foo()
        ret
baz():
        call    bar()
        ret

g++ -O2 would optimize these call/ret functions into a tailcall like jmp foo, which doesn't push a return address. That's why I only used -O1 or -Og. Of course in real life you do want things to inline and optimize away; defeating that is just for this silly computer trick to achieve the longest finite call depth that actually would crash if you made it longer.

You can repeat this pattern indefinitely; GCC allows long symbol names, so it's not a problem to have many different unique function names. You can split across multiple files, with just a prototype for one of the functions in the other file.

If you reduce the -falign-functions tuning setting, you can probably get it down to either 6 bytes per function (no padding) or 8 (align by 8 thus 2 bytes of padding), down from the default of aligning each function label by 16, wasting 10 bytes per function.

Recursive:

I got GCC to make asm that recurses with no gaps between return address:

void recurse(int n){
    if (!--n)
        return;
    recurse(n);
}

## G++11.2  -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
        sub     eax, 1       # --n
        jne     .L6          # skip over the ret if the result is non-zero
.L4:
        ret
.L6:
        call    recurse(int)   # push a 4-byte ret addr and jump to top
        jmp     .L4          # silly compiler, should put another ret here.  But we told it not to optimize too much

Note the -mregparm=3 option, so the first up-to-3 args are passed in registers, instead of on the stack in the inefficient i386 System V calling convention. (It was designed a long time ago; the x86-64 SysV calling convention is much better.)

With -O2, this function optimizes away to just a ret. (It turns the call into a tailcall which becomes a loop, and then it can see it's a non-infinite loop with no side effects so it just removes it.)

Of course in real life you want optimizations like this. Recursion in asm sucks compared to loops. If you're worried about robust code and call depth, don't write recursive functions in the first place, unless you know recursion depth will be shallow. If a debug build doesn't convert your recursion to iteration, you don't want your kernel to crash.

Just for fun, I got GCC to tighten up the asm even at -O1 by convincing it to do the conditional branch over the call, to only have one ret instruction in the function so tail-duplication wouldn't be relevant anyway. And it means the fast path (recursion) involves a not-taken macro-fused conditional branch plus a call.

void recurse(int n){
    if (__builtin_expect(!!--n, 1))
        recurse(n);
}

Godbolt

## GCC 11.2  -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
        sub     eax, 1
        je      .L1
        call    recurse(int)
.L1:
        ret

to how many subroutines can x86 processors call?

2 Answers2

Recursive: