4

I have read some questions about returning more than one value such as What is the reason behind having only one return value in C++ and Java?, Returning multiple values from a C++ function and Why do most programming languages only support returning a single value from a function?.

I agree with most of the arguments used to prove that more than one return value is not strictly necessary and I understand why such feature hasn't been implemented, but I still can't understand why can't we use multiple caller-saved registers such as ECX and EDX to return such values.

Wouldn't it be faster to use the registers instead of creating a Class/Struct to store those values or passing arguments by reference/pointers, both of which use memory to store them? If it is possible to do such thing, does any C/C++ compiler use this feature to speed up the code?

Edit:

An ideal code would be like this:

(int, int) getTwoValues(void) { return 1, 2; }

int main(int argc, char** argv)
{
    // a and b are actually returned in registers
    // so future operations with a and b are faster
    (int a, int b) = getTwoValues();
    // do something with a and b
    
    return 0;
}
Community
  • 1
  • 1
Nighteen
  • 731
  • 1
  • 7
  • 16
  • 4
    You can logically `return` more than one value, if your `return` a `struct`. – Fiddling Bits Jul 19 '15 at 01:34
  • 2
    But if I return a struct, wouldn't I be returning a pointer to that struct in memory that would be used to get the values after the function returns? – Nighteen Jul 19 '15 at 01:35
  • 1
    well fastcall uses registers for the first 2 params in ecx and edx and returns eax and its not much faster than the default cdecl calling convension. Some benchmarks even shows that fastcall is slower. –  Jul 19 '15 at 01:37
  • 1
    Do you have example code that generates worse machine code (in optimized builds) than you like? – Kerrek SB Jul 19 '15 at 01:37
  • 2
    @user2565020 that's an implementation detail that is up to the compiler – M.M Jul 19 '15 at 01:37
  • 1
    in C++ you can return `pair` or `tuple` to return multiple values – M.M Jul 19 '15 at 01:38
  • Are you asking why C++ was designed this way? Or are you asking if modern hardware would support a language that had multiple return values? This starts as a design question, but then jumps to hardware implementation. – Drew Dormann Jul 19 '15 at 01:39
  • @CamelToe But arguments passed does not interfere in return values, the registers can simply be overwritten with no loss. – Nighteen Jul 19 '15 at 01:40
  • 1
    @DrewDormann I am asking both of those questions :) – Nighteen Jul 19 '15 at 01:41
  • well you only have so many registers you can work with in x86 so how many return values should be the max limit? Also my point was that you rarely see any performance gain even with fastcall that uses the registers so why bother doing it with multiple values which is limited as well –  Jul 19 '15 at 01:46
  • @CamelToe Ok I understand the speed gain is not much, but it should be theoretically faster. – Nighteen Jul 19 '15 at 01:48
  • In x86 long return values are passed in edx:eax or rdx:rax. In MIPS there are also 2 registers for returning value: v0 and v1. – phuclv Jul 19 '15 at 02:20
  • In ARM it's R0 and R1 – phuclv Jul 19 '15 at 02:31
  • 1
    It is certainly possible to return multiple values, it's simply a language design choice that was made years and years ago. If you'd like to return multiple values the solution is to use call by reference rather than call by value. It's not precisely "returning" but it's certainly allowing modification. Certainly multiple returns could be done in registers or, even better, using the stack. It's simply not done in the languages that you list. – David Hoelzer Jul 19 '15 at 11:29
  • Be careful about answers that attack or beg the question. It's pretty common to find "why can't we have X" questions answered with "because having X is bad, actually" when that isn't the case even remotely. More often than not it turns out to be a historical reason with tons of outliers that invalidate the whole "because having X is bad, actually" class of arguments. – raianmr Jan 22 '23 at 16:19

3 Answers3

8

Yes, this is sometimes done. If you read the Wikipedia page on x86 calling conventions under cdecl:

There are some variations in the interpretation of cdecl, particularly in how to return values. As a result, x86 programs compiled for different operating system platforms and/or by different compilers can be incompatible, even if they both use the "cdecl" convention and do not call out to the underlying environment. Some compilers return simple data structures with a length of 2 registers or less in the register pair EAX:EDX, and larger structures and class objects requiring special treatment by the exception handler (e.g., a defined constructor, destructor, or assignment) are returned in memory. To pass "in memory", the caller allocates memory and passes a pointer to it as a hidden first parameter; the callee populates the memory and returns the pointer, popping the hidden pointer when returning.

(emphasis mine)

Ultimately, it comes down to calling convention. It's possible for your compiler to optimize your code to use whatever registers it wants, but when your code interacts with other code (like the operating system), it needs to follow the standard calling conventions, which typically uses 1 register for returning values.

Cornstalks
  • 37,137
  • 18
  • 79
  • 144
  • If I understood correctly, are the structures converted to fit 2 registers? If so, is it possible to use more than 2 registers? – Nighteen Jul 19 '15 at 01:46
  • 2
    The data structure needs to fit in 2 registers. If 2 registers don't provide enough bits to store the whole structure, then it needs to use memory to pass the structure. But this is for this particular calling convention. It's possible for the compiler to use however many registers it wants (within physical limits) for returning values, so it's possible for another compiler to use 3+ registers for returning values. – Cornstalks Jul 19 '15 at 01:48
  • Does any compiler do such thing(more than 2 registers), especially MSVC and GCC? – Nighteen Jul 19 '15 at 01:49
  • 1
    I'm not sure, to be honest. But I'm also not sure if it matters very much. Modern CPUs have hundreds of registers and employ register renaming and value forwarding in their pipeline. Those kinds of things make optimizations like this less worthwhile than they would be otherwise. – Cornstalks Jul 19 '15 at 02:00
  • For x86, because of the limited number of registers, probably no compiler will use more registers. For x86_64, with a lot more registers, things are different – phuclv Jul 19 '15 at 02:25
  • 1
    the 64 bit has the same registers, but 64bit instead, plus adds about double the amount of specialty registers (like those used by SSE and MMX) and adds more fpu registers as well. 64bit also access more memory due to wider bus addresses. though in most cases, its not that special. If you like, the tech. specs for more CPUs are publicly available, it's a good read if you really want to understand how they work. They include full instruction set, pipeline explanations, sample code, everything you need to know to make an OS :) – ydobonebi Jul 19 '15 at 03:40
  • @QuinnRoundy it's not the same. The number of general purpose registers have been doubled in x86_64 (actually more than double, because previously we only have 6 or 7 [minus esp-ebp], now we have 14 or 15 [minus rsp-rbp]) – phuclv Jul 25 '15 at 09:49
  • I suppose I under-generalized. I have a complete list of the registers if you would like me to post it. I wasn't trying to be exact. (R/E)A(L/H,X,etc),B,C,D,SI/SIL,BP/BPL, SP/SPL, R8-R15(L/W/D) for the general purpose registers. Then CS/DS/SS/ES/FS/GS for segment registors. FLAGS/EFLAGS/RFLAGS for the flags. Then we have ST0-ST7 which are FPU registers. the MMX/XMM adds MM0-7 and XMM0-7. Some registors used by the system are the CR0-CR4, GDTR, LDTR, IDSR. ANd of course lets not forget the Debugs DR0-DR7. There are even more. Source: Intel 64 and IA32 Architectures Software Developer's Manual. – ydobonebi Jul 25 '15 at 16:16
  • THere's a few I now exist that I couldn't see listed in the manual. Plus SSE I thought added registers of their own, but I couldn't find anything on it in their manual. The ones added by 64bit are the ones prefeced by R ie: RAX is 64 bit A register, EAX is the 32bit A register, AX is the 16bit A register. and AH/AL are the high and low byte of the A register. So I guess it's less than double if you consider all general purpose were simply doubled but the ST and MX and XMM were not (I believe those are 128bit) as well as CR and DTR regsters stayed the same – ydobonebi Jul 25 '15 at 16:17
4

Returning in stack isn't necessarily slower, because once the values are available in L1 cache (which the stack often fulfills), accessing them will be very fast.

However in most computer architectures there are at least 2 registers to return values that are twice (or more) as wide as the word size (edx:eax in x86, rdx:rax in x86_64, $v0 and $v1 in MIPS (Why MIPS assembler has more that one register for return value?), R0:R3 in ARM1, X0:X7 in ARM64...). The ones that don't have are mostly microcontrollers with only one accumulator or a very limited number of registers.

1"If the type of value returned is too large to fit in r0 to r3, or whose size cannot be determined statically at compile time, then the caller must allocate space for that value at run time, and pass a pointer to that space in r0."

These registers can also be used for returning directly small structs that fits in 2 (or more depending on architecture and ABI) registers or less.

For example with the following code

struct Point
{
    int x, y;
};

struct shortPoint
{
    short x, y;
};

struct Point3D
{
    int x, y, z;
};

Point P1()
{
    Point p;
    p.x = 1;
    p.y = 2;
    return p;
}

Point P2()
{
    Point p;
    p.x = 1;
    p.y = 0;
    return p;
}

shortPoint P3()
{
    shortPoint p;
    p.x = 1;
    p.y = 0;
    return p;
}

Point3D P4()
{
    Point3D p;
    p.x = 1;
    p.y = 2;
    p.z = 3;
    return p;
}

Clang emits the following instructions for x86_64 as you can see here

P1():                                 # @P1()
    movabs  rax, 8589934593
    ret

P2():                                 # @P2()
    mov eax, 1
    ret

P3():                                 # @P3()
    mov eax, 1
    ret

P4():                                 # @P4()
    movabs  rax, 8589934593
    mov edx, 3
    ret

For ARM64:

P1():
    mov x0, 1
    orr x0, x0, 8589934592
    ret
P2():
    mov x0, 1
    ret
P3():
    mov w0, 1
    ret
P4():
    mov x1, 1
    mov x0, 0
    sub sp, sp, #16
    bfi x0, x1, 0, 32
    mov x1, 2
    bfi x0, x1, 32, 32
    add sp, sp, 16
    mov x1, 3
    ret

As you can see, no stack operations are involved. You can switch to other compilers to see that the values are mainly returned on registers.

phuclv
  • 37,963
  • 15
  • 156
  • 475
1

Return data is put on the stack. Returning a struct by copy is literally the same thing as returning multiple values in that all it's data members are put on the stack. If you want multiple return values that is the simplest way. I know in Lua that's exactly how it handles it, just wraps it in a struct. Why it was never implemented, probably because you could just do it with a struct, so why implement a different method? As for C++, it actually does support multiple return values, but it's in the form of a special class, really the same way Java handles multiple return values (tuples) as well. So in the end, it's all the same, either you copy the data raw (non-pointer/non-reference to a struct/object) or just copy a pointer to a collection that stores multiple values.

ydobonebi
  • 240
  • 2
  • 11
  • But returning values on stack is actually slower than returning lets say two int values on EAX and ECX. – Nighteen Jul 19 '15 at 01:44
  • Lua doesn't wrap multiple return values in a "struct"; it has a stack and pushes all the return values onto that. Are you confusing `return a, b, c` with `return {a, b, c}` perhaps? – Kerrek SB Jul 19 '15 at 01:45
  • fair point, but we're arguing semantics. It would be the same if it were a struct or not. All the values are pushed onto the stack the same either way. As for returning via registers, you're splitting hairs for that trivial amount of optimization. Yea, registers are faster but if you actually check the number of clock cycles and compare, then convert to real time, we're talking about an immeasurable amount of time saving. EDIT: I just looked it up, the clock cycles for the push operation versus moving to register is pretty much the same, so there really isn't any difference. – ydobonebi Jul 19 '15 at 03:31
  • I checked 4 different processor btw, the difference is around 1-2 clock cycles, which averages (based on same clock speeds) about 0.0000000001 second of time savings (I hope I put enough 0s in there). – ydobonebi Jul 19 '15 at 03:35