Does argument order for the C calling convention ever have a performance impact?

Question

For example, would the functions:

void foo(float*,float*,int,float);
void foo(float*,float,float*,int);

have the same or different overhead?

Edit: I'm not asking about how the compiler will optimize things. I'm specifically asking in relation to the cdecl calling convention how overheads will be different on various ABIs.

Related: I something about gcc's counterintuitive-but-faster right-to-left argument passing vs clang's intuitive-but-slower left-to-right argument evaluation. — o11c, Jul 03 '15 at 21:52
I would imagine that it would be the same unless there were so many arguments that some had to be put on the stack. Then the order may matter as it might determine which types are placed on the stack, no? — asimes, Jul 03 '15 at 21:55
Its a dupe. I remember an exact question on SO with C++ tag but failed to find it. — haccks, Jul 03 '15 at 22:09
Note: The PCS and possibly ABI massively depends on the target platform. Not only PCU, but also OS play major roles. The question as asked now cannot be answered right now. — too honest for this site, Jul 03 '15 at 22:16
Your last edit does not make the question less broad. I still tend to vote to close - yet did not. Hmm, missing research effort might be another option. Did you actually even did some research yourself? — too honest for this site, Jul 03 '15 at 22:18
My machine is x86-64, so that's what I'm most interested in. But I'm really interested in a broader understanding so I could apply this knowledge to any ABI. — Mike Izbicki, Jul 03 '15 at 22:18
Mike: you **cannot** apply any answert to a different ABI. Let alone different registers set, implementation of stacks, there are even for C various ABIs. I strongly recommend to read some PCS and ABIs. From a personal view, I can recommend the related documents from ARM: AAPCS (start here) and the various ABIs for C, C++, Exception handling, etc. These are freely available for download or online-reading from ARM. AFAIK, Intel provides similar documents, but as I never read needed them, I cannot tell anything about their quality. — too honest for this site, Jul 03 '15 at 22:22
@iharob no overload here, just two signature suggestions for the same function. — Quentin, Jul 03 '15 at 22:23
How can you ask about the performance impact of something when you aren't asking about how the compiler will optimize it? In a completely naive view, the compiler could refuse to use registers and always put all the arguments on the stack, in which case it almost certainly doesn't matter (unless you get cache performance issues). I don't think this question has a reasonable answer. — Jeremy West, Jul 03 '15 at 22:24
@JeremyWest The functions are exposed at a library interface using the cdecl convention. Internally the compiler's doing all sorts of amazing optimizations, but it can't touch the interface. — Mike Izbicki, Jul 03 '15 at 22:25
@MikeIzbicki: Ever heard about link time optimization (LTO)? That will inline functions even from other compilation units. That is exactly my point: do some more research! — too honest for this site, Jul 03 '15 at 22:27
Oh, so the functions are compiled into a shared library? If it is a static library, it is still possible for the compiler to do whatever it wants (depending on how it was compiled). — Jeremy West, Jul 03 '15 at 22:28
I'm certain I want an answer to the question I asked :) The functions are in a static library, but I'm accessing the library through another language that requires the cdecl calling convention and does no transformations. — Mike Izbicki, Jul 03 '15 at 22:30
Well, you just provided more information than the question you asked which helps to make the question answerable :). According the definition, unless I'm missing something, all arguments are passed on the stack in a specific order. The only affect on performance I could see from ordering would be cache effects depending on how they are accessed in the function. See https://en.wikipedia.org/wiki/X86_calling_conventions#cdecl — Jeremy West, Jul 03 '15 at 22:33
@Olaf Maybe I'm misunderstanding the line between a calling convention and an ABI. Take the example of struct padding. The result will obviously be different depending on the ABI, but if you know how struct padding works and you know the ABI, then you can figure out the details. I'm wondering if the cdecl convention has anything similar going on. — Mike Izbicki, Jul 03 '15 at 22:33
Mike: I fully appreciate your curiosity. Read the documents I recommended (or search for the Intel-equivalents), First have a look at Wikipedia. These will answer a lot of your questions. If you have a **specific** question then, feel free to ask. Anything else is beyond the scope of this site; it is not a tutorial or teaching site - please understand. — too honest for this site, Jul 03 '15 at 22:37
Ok, just a very brief info: The procedure call standard (PCS) defines for a platform (CPU and - possibly - OS) how to pass cetain types ((/16/32 integers, floats, compound, etc.) between functions. The ABI specifies for each language how the specific types are converted to the base-types defined by the PCS. Sometimes both are combined into a single document, but still for each language. I'm afraid, there are still things you will have to read and not just watch at YouTube - luckily. — too honest for this site, Jul 03 '15 at 22:41
Even if there is a difference, it's probably too little to matter. If you're programming an application, and you want maximum performance, don't laser-focus on tiny stuff. There's big stuff hovering outside the range of what you can guess. [*Find out what it is.*](http://stackoverflow.com/a/378024/23771) — Mike Dunlavey, Jul 05 '15 at 19:53

score 3 · Answer 1 · answered Jul 03 '15 at 22:03

Of course this kind of detail depends on the platform/ABI.

For example with x86-64 there should be no difference as this few parameters would be passed simply in registers and register use is almost symmetrical (so it doesn't really matter which register you want to use).

With more parameters there would be some stack spilling and in this case it could make a difference depending on how the spilled parameters are used in the body of the function.

For example if they're needed as a count for a loop then they can be used directly from the stack, if instead they're pointers then to dereference the pointed value they must be first moved into a register.

Note that of course what exactly happens is up to the compiler (even with just two parameters)... so it's not impossible that there are differences; what is impossible is using a specific order to get better result in general (i.e. independently from the compiler).

Purag · Accepted Answer · 2015-07-04T09:35:32.567

Traditional calling conventions will almost always allocate parameter space on the stack, and there is always overhead associated with copying arguments into this space.

Assuming a strictly volatile environment, the only additional overhead that can potentially exist can arise from memory alignment issues. In your given example, the parameters will be in contiguous memory and so there won't be any padding to align properly.

In the case of parameters with types of varying sizes, the parameters in the following declaration:

int func (int a, char c, int b)

will have padding between them, whereas those in this declaration:

int func (int a, int b, char c)

will not.

The stack frame for the former might look like:

| local vars... |                  low memory
+---------------+ - frame pointer
| a | a | a | a |
| c | X | X | X |
| b | b | b | b |
+---------------+                  high memory

And for the latter:

| local vars... |                  low memory
+---------------+ - frame pointer
| a | a | a | a |
| b | b | b | b |
| c | X | X | X |
+---------------+                  high memory

When the function gets called, the arguments will be written into the stack memory in the order they appear, so for the former you'll write the 4 bytes of int a, the 1 byte of char c, then you need to skip those 3 bytes to write the 4 bytes of int b.

In the latter, you'll be writing into contiguous memory locations, and won't need to account for skips due to padding.

In a volatile environment, we're talking about a difference in performance on the order of several nanoseconds for the skips. The performance hit may be detectable but almost negligible.

(By the way, how skipping is done is entirely architecture-dependent...but I'd bet in general it is just a higher offset for the next address to fill. I'm not completely sure how this might be done differently in different architectures).

Of course, in a non-volatile environment, when we utilize CPU caching, the performance hit goes down to fractions of a nanosecond. We'd be venturing into the undetectable, and so the difference is effectively nonexistent.

Data padding is really only a space cost. When you're working in embedded systems, you'll want to order your parameters from largest to smallest to reduce (and sometimes eliminate) padding.

So, as far as I can tell (without further information like the exact data transfer rates between memory on a particular machine or architecture), there shouldn't be a performance hit for different parameter orders.

Will the function `void foo(char,char,char,char,int)` use 8 bytes of stack or 5*4=20 bytes? Where can I find I reference that discusses this? — Mike Izbicki, Jul 04 '15 at 20:44
Well, the parameter space will be 8 bytes. There is other stuff on the stack, like local variables, return addresses, possibly return *values*, etc. It depends very heavily on the architecture. But yes, you have four one-byte characters and one four-byte int; there's no padding, so it takes up 8 bytes. — Purag, Jul 04 '15 at 20:49

Does argument order for the C calling convention ever have a performance impact?

2 Answers2