1

How much would performance differ between these two situations?

int func(int a, int b) { return a + b; }

And

void func(int a, int b, int * c) { *c = a + b; }

Now, what if it's a struct?

typedef struct { int a; int b; char c; } my;

my func(int a, int b, char c) { my x; x.a = a; x.b = b; x.c = c; return x; }

And

void func(int a, int b, int c, my * x) { x->a = a; x->b = b; x->c = c; }

One thing I can think of is that a register cannot be used for this purpose, correct? Other than that, I am unaware of how this function would turn out after going trough a compiler.

Which would be more efficient and speedy?

Carol Victor
  • 331
  • 1
  • 7
  • 5
    Check the generated assembler in all cases (with compiler optimisations turned on), and profile the performance. – Bathsheba Nov 08 '20 at 15:48
  • where is the correlation in the last 2 cases with the first cases. Those last 2 cases are not even remotely similar – Chase Nov 08 '20 at 15:50
  • 1
    I'm not knowledgeable enough to determine how good the produced assembly code is (i.e. could be few instructions with greater latency) nor am I knowledgeable enough to benchmark them properly, so I am looking for answers from people who have questioned this before themselves and have been able to come to a conclusion on how the compiler behaves. – Carol Victor Nov 08 '20 at 15:50
  • 1
    @Chase It is correlated to my title, comparing returning data vs. using a pointer. – Carol Victor Nov 08 '20 at 15:51
  • Does this answer your question? [Return a \`struct\` from a function in C](https://stackoverflow.com/questions/9653072/return-a-struct-from-a-function-in-c) or [How does function ACTUALLY return struct variable in C?](https://stackoverflow.com/q/22957175/69809) – vgru Nov 08 '20 at 15:54
  • You can return a struct from a function, but, depending on its size and compiler implementation/optimization settings, you could end up with lots of copying. Gcc even has a `-Waggregate-return` setting to warn you if you are returning structs. But generally, I find returning small structs to be more readable, and they are handled well at higher optimization settings. – vgru Nov 08 '20 at 15:57
  • @Groo It does tell me how Example 3 would be dealt with, but I am interested in performance, some benchmarks perhaps. – Carol Victor Nov 08 '20 at 15:58
  • @Carol: it depends on the exact case, compiler and its optimization settings. Generally, returning the struct means *copying* the returned data (additional work compared to the parameter reference). Unless the struct is tiny (fits in a register) and the compiler can optimize the call, *or* the function is static and inlined, and the compiler knows where to put the results at the end of the call. Also, the parameter-based variant allows you to change specific fields only, while the return-based must ensure that the struct is fully overwritten (at least functionally). – vgru Nov 08 '20 at 16:02
  • *"I'm not knowledgeable enough...[so I'm seeking advice from other people]"*. That's perfectly fine. But the advice is to learn how to profile your code. – JohnFilleau Nov 08 '20 at 16:04
  • If it is a tight loop with millions of iterations, the extra pointer indirection can be a performance killer. But that can vary by platform. The answer is: profile, profile, profile. Preferably with a real application rather than a small toy program that tries to exacerbate the situation because that could easily encode a bias that will skew the results. – Eljay Nov 08 '20 at 16:27
  • Even with pointer parameter, it is often better to fill local struct and write it to the pointer only at the end of the function body. That's because of aliasing problem: when you write to local struct, compiler is sure it won't affect other variables; but every time you write to external pointer, compiler assumes everything else nonlocal may change. – stgatilov Nov 20 '20 at 18:01

1 Answers1

3

If the function can inline, often no difference between the first 2.

Otherwise (no inlining because of no link-time optimization) returning an int by value is more efficient because it's just a value in a register that can be used right away. Also, the caller didn't have to pass as many args, or find/make space to point at. If the caller does want to use the output value, it will have to reload it, introducing latency in the total dependency chain from inputs ready to output ready. (Store-forwarding latency is ~5 cycles on modern x86 CPUs, vs. 1 cycle latency for the lea eax, [rdi + rsi] that would implement that function for x86-64 System V.

The exception is maybe for rare cases where the caller isn't going to use the value, just wants it in memory at some address. Passing that address to the callee (in a register) so it can be used there means the caller doesn't have to keep that address anywhere that will survive across the function call.


For the struct version:

a register cannot be used for this purpose, correct?

No, for some calling conventions, small structs can be returned in registers.

x86-64 System V will return your my struct by value in the RDX:RAX register pair because it's less than 16 bytes and all integer. (And trivially copyable.) Try it on https://godbolt.org/z/x73cEh -

# clang11.0 -O3 for x86-64 SysV
func_val:
        shl     rsi, 32
        mov     eax, edi
        or      rax, rsi             # (uint64_t)b<<32 | a;  the low 64 bits of the struct
    # c was already in EDX, the low half of RDX; clang leaves it there.
        ret
func_out:
        mov     dword ptr [rcx], edi
        mov     dword ptr [rcx + 4], esi        # just store the struct members 
        mov     byte ptr [rcx + 8], dl          # to memory pointed-to by 4th arg
        ret

GCC doesn't assume that char c is correctly sign-extended to EDX the way clang does (unofficial ABI feature). GCC does a really dumb byte store / dword reload that creates a store-forwarding stall, to get uninitialized garbage from memory instead of from high bytes of EDX. Purely a missed optimization, but see it in https://godbolt.org/z/WGcqKc. It also insanely uses SSE2 to merge the two integers into a 64-bit value before doing a movq rax, xmm0, or to memory for the output-arg.

You definitely want the struct version to inline if the caller uses the values, so this packing into return-value registers can be optimized away.

How does function ACTUALLY return struct variable in C? has an ARM example for a larger struct: return by value passes a hidden pointer to the caller's return-value object. From there, it may need to be copied by the caller if assigning to something that escape analysis can't prove is private. (e.g. through some pointer). What prevents the usage of a function argument as hidden pointer?

Also related: Why is tailcall optimization not performed for types of class MEMORY?

How do C compilers implement functions that return large structures? points out that code-gen may differ between C and C++.

I don't know how to explain any general rule of thumb that one could apply without understand asm and the calling convention you care about. Usually pass/return large structs by reference, but for small structs it's very much "it depends".

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847