Make a variable argument function callee cleanup

Question

Suppose I have a function:

int sumN(int n, ...)
{
    int sum = 0;
    va_list vl;
    va_start(vl, n);
    for (int i = 0; i < n; i++)
        sum += va_arg(vl, int);

    va_end(vl);
    return sum;
}

Called as sumN(3, 10, 20, 30); The function is cdecl, which means caller cleanup. So, what happens is something like:

; Push arguments right-to-left
push 30
push 20
push 10
push 3
call sumN
add esp, 16 ; Remove arguments from stack (equivalent to 4 pops)

For regular functions that take a fixed number of arguments, the callee can perform the cleanup, as part of the ret instruction (e.g. ret 16). That doesn't work here because the callee can't know how many arguments were pushed - I could call it as sumN(1, 10, 20, 30, 40, 50); and cause a stack corruption.

Now, I want to do it anyway. Maybe I have a tool that parses the source code before the build and makes sure all calls are legitimate. And I'm calling sumN() 50k times in my codebase, so the extra size from the last instruction adds up.

For the above implementation, it's easily done in assembly, but if it were a printf function or something where the logic to figure out the size is a bit more complex, that's no longer an option. Still, I could do some inline assembly or something and fix the implementation of sumN to pop the stack. But if anyone has a better solution, that's very welcome.

The big question, however, is how to tell the compiler that the function is callee cleanup when it has ... in its declaration? How to prevent the compiler from generating the add esp, 16 instruction?

Ideally I need this for msvc, gcc and clang, but msvc is a priority.

Related: Can stdcall have a variable arguments?

You didn't say which compiler, but I doubt any of them can be told to omit the caller cleanup. The most you can do is specify calling conventions and unfortunately for you, that forces varargs to be caller cleanup. AFAIK. — Jester, Feb 25 '16 at 12:09
@Jester most important for msvc, but ideally also for clang and gcc. If I specify for example stdcall at prototype, the compiler ignores it (see answer to linked question) — mtijanic, Feb 25 '16 at 12:11
I'd say it doesn't ignore it. It's just that stdcall convention for varargs is also caller-cleanup. — Jester, Feb 25 '16 at 12:12
This may helps you http://stackoverflow.com/questions/21501222/finding-stack-frame-size — Jean-Baptiste Yunès, Feb 25 '16 at 12:20
This will break the ABI; implications are worse than only disallowing using the libc. XY-problem? What do you actually want to accomplish? — too honest for this site, Feb 25 '16 at 12:29
@Olaf I'm in kernelmode, no external dependencies. And this is an internal function, never exported anywhere else. I'm trying to log various events throughout the code, but it inflates the code size significantly - For kernelmode nonpaged code, that's a problem. MSFT especially imposes some very strict limits here. — mtijanic, Feb 25 '16 at 12:37
The compiler does not know a "kernel mode". And a kernal also has internal dependencies. I strongly recommend to research more on the implications before fiddling with the ABI. Also note that arguments ar not necessarily passed on the stack. Modern ABIs are much more complicated, even for variadic functions. Even more as maintenance also becomes a problem. You still did not answer why you wanted that. — too honest for this site, Feb 25 '16 at 12:48
@Olaf I did: "I'm trying to log various events throughout the code". And other than being harder for engineers to understand (least astonishment principle and all), I don't see functional problem with it. There is one function that has a weird ABI. It is always called in the same fashion - `LOG(...)`, LOG is actually a macro that does some magic to minimize code size per call. I agree that an ABI tends to be more complicated, but a disasm of a real code shows that on build configuration X, it's not. So, on X, I want to do it differently. Can you give an example of what could break? — mtijanic, Feb 25 '16 at 12:53
@MargaretBloom It would, yes, in terms of instructions executed/speed. But, the function is called in many, MANY places, which is many, MANY times a single instruction in the code, compared to even a 100 extra instructions for callee cleanup. That inflates the code size, which has all sorts of other consequences. I care about final binary code size more than the actual instructions executed. — mtijanic, Feb 25 '16 at 12:55
You claim that "it's easily done in assembly", but I don't see how. As long as the return address is below the arguments on the stack, the only way the callee can adjust the stack pointer to pop the arguments is by `ret N`, which, as you say, doesn't work for a variable number of arguments. — EOF, Feb 25 '16 at 13:32
@EOF On the callee side, you can't use `ret N` because the `ret` instruction requires an immediate operand, true. But, you can do something like `pop eax; add esp, N; push eax; ret;` - basically take the return address in a register, change the stack size as needed, then push the return address back and do a normal ret. It's a bit tricky, but it is doable. — mtijanic, Feb 25 '16 at 13:36
@mtijanic: Have you tested whether this interferes with the hardware return stack / return address prediction in modern x86 micro-architecures? If it does, this trick will make the code drastically *slower*. — EOF, Feb 25 '16 at 13:39
I was looking for the code for `dev_notice` in linux source code but did not find any. It'll be interesting to see how they handled it. — user3528438, Feb 25 '16 at 13:56
@EOF I had done something like this before, and it worked correctly. I didn't profile it, so it is quite likely it was slower. However, I'm willing to make the speed tradeoff if it helps cut down on the code size. I have a strict limit on maximum code size, while the speed is much more flexible. — mtijanic, Feb 25 '16 at 13:58
@EOF: I wondered the same thing about the return-address predictor. I think it's actually ok: the return-address predictor only looks for `call` / `ret` pairs, and uses a small internal stack to hold return addresses. It doesn't care about `rsp`, or the data that's actually read from memory by `ret`. As long as it eventually matches the predicted RIP (which it will if you copied it correctly), the correct path matches the predicted path, so no stall. — Peter Cordes, Feb 26 '16 at 04:25

score 2 · Accepted Answer · answered Feb 25 '16 at 14:01

What you can do is make a number of helper functions. Each helper function would take a fixed number of elements, and picking which helper function to call would be done at compile time. Then, each helper function would call your vararg function.

You will save one instruction per call, at a cost of n helper functions, where n is the maximal number of possible arguments.

Sample code:

#include <stdio.h>
#include <stdarg.h>
#include <stdint.h>

#define GET_MACRO(_1,_2,_3,NAME,...) NAME
#define func(...) GET_MACRO(__VA_ARGS__, helper3, helper2, helper1)(__VA_ARGS__)

void varargFn(int n, ...)
{
        int sum = 0;
        va_list vl;
        va_start(vl, n);
        for (int i = 0; i < n; i++)
                sum += va_arg(vl, int64_t);

        va_end(vl);
        printf("%d\n", sum);
}

void helper1(void *v1)
{
        varargFn(1, v1);
}

void helper2(void *v1, void *v2)
{
        varargFn(2, v1, v2);
}

void helper3(void *v1, void *v2, void *v3)
{
        varargFn(3, v1, v2, v3);
}

int main()
{
        func((void *) 5);
        func((void *) 5, (void *) 5);
        func((void *) 5, (void *) 5, (void *) 5);

        return 0;
}

And a short snippet generated from running gcc -s -Os -std=c99

helper3:
.LFB14:
        .cfi_startproc
        movq    %rdx, %rcx
        xorl    %eax, %eax
        movq    %rsi, %rdx
        movq    %rdi, %rsi
        movl    $3, %edi
        jmp     varargFn
        .cfi_endproc
.LFE14:
        .size   helper3, .-helper3
        .section        .text.startup,"ax",@progbits
        .globl  main
        .type   main, @function
main:
.LFB15:
        .cfi_startproc
        pushq   %rax
        .cfi_def_cfa_offset 16
        movl    $5, %edi
        call    helper1
        movl    $5, %esi
        movl    $5, %edi
        call    helper2
        movl    $5, %edx
        movl    $5, %esi
        movl    $5, %edi
        call    helper3
        xorl    %eax, %eax
        popq    %rdx
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc
.LFE15:
        .size   main, .-main

You could probably squeeze a couple of more bytes from helper functions if you manage to avoid this nasty shift of n elements across registers. One idea that comes to mind is to rewrite helper3 as:

void helper3(void *v1, void *v2, void *v3)
{
    varargFn(3, v2, v3, v1);
}

but then you would have to modify your varargFn, which might not be worth the trouble.

Using `void*` is not a universally good idea. In 32 bit mode you might want to print a 64 bit value which does not fit into a void*. In 64 bit mode you might want to print a float which is not passed in the same register. — Jester, Feb 25 '16 at 16:40
Good point; you might have to provide helper functions for every combination of types that you use. Just writing them in asm doesn't help. — Peter Cordes, Feb 26 '16 at 04:30

Make a variable argument function callee cleanup

1 Answers1