This is Undefined Behaviour. It might happen to work, or it might not, depending on what the compiler happened to choose when compiling for some specific target. It's literally undefined, not "guaranteed to break"; that's the entire point. Compilers can just completely ignore the possibility of UB when generating code, not using extra instructions to make sure UB breaks. (If you want that, compile with -fsanitize=undefined
).
Understanding exactly what happened requires looking at the asm, not just trying running it.
warning: address of stack memory associated with local variable 'buf' returned [-Wreturn-stack-address]
return buf;
^~~
Clang prints this warning even without -Wall
enabled. Exactly because it's not legal C, regardless of what asm calling convention you're targeting.
Clang uses the C calling convention of the target it's compiling for1. Different OSes on the same ISA can have different conventions, although outside of x86 most ISAs only have one major calling convention. x86 has been around so long that the original calling conventions (stack args with no register args) were inefficient so various 32-bit conventions evolved. And Microsoft chose a different 64-bit convention from everyone else. So there's x86-64 System V, Windows x64, i386 System V for 32-bit x86, AArch64's standard convention, PowerPC's standard convention, etc. etc.
I have tested with clang several times and every time I displayed the string
The "decision" / "luck" of whether it "works" or not is made at compile time, not runtime. Compiling / running the same source multiple times with the same compiler tells you nothing.
Look at the generated asm to find out where char buf[4]
ends up.
My guess: maybe you're on Windows x64. Happening to work is more plausible there than most calling conventions, where you'd expect buf[4]
to end up below the stack pointer in main
, so the call
to puts
, and puts
itself, would be very likely to overwrite it.
If you're on Windows x64 compiling with optimization disabled, retx()
's local char buf[4]
might be placed in the shadow space it owns. The caller then calls puts()
with the same stack alignment, so retx
's shadow space becomes puts
's shadow space.
And if puts
happens not to write its shadow space, then the data in memory that retx
stored is still there. e.g. maybe puts
is a wrapper function that in turn calls another function, without initializing a bunch of locals for itself first. But not a tailcall, so it allocates new shadow space.
(But that's not what clang8.0 does in practice with optimization disabled. It looks like buf[4]
will be placed below RSP and get stepped on there, using __attribute__((ms_abi))
to get Windows x64 code-gen from Linux clang: https://godbolt.org/z/2VszYg)
But it's also possible in stack-args conventions where padding is left to align the stack pointer by 16 before a call. (e.g. modern i386 System V on Linux for 32-bit x86). puts()
has an arg but retx()
doesn't, so maybe buf[4]
ended up in memory that the caller "allocates" as padding before pushing a pointer arg for puts
.
Of course that would be unsafe because the data would be temporarily below the stack pointer, in a calling convention with no red-zone. (Only a few ABIs / calling conventions have red zones: memory below the stack pointer that's guaranteed not to be clobbered asynchronously by signal handlers, exception handlers, or debuggers calling functions in the target process.)
I wondered if enabling optimization would make it inline and happen to work. But no, I tested that for Windows x64: https://godbolt.org/z/k3xGe4. clang and MSVC both optimize away any stores of "buf\0"
into memory. Instead they just pass puts
a pointer to some uninitialized stack memory.
Code that breaks with optimization enabled is almost always UB.
Footnote 1: Except for x86-64 System V, where clang uses an extra un-documented "feature" of the calling convention: Narrow integer types as function args in registers are assumed to be sign-extended to 32 bits. gcc and clang both do this when calling, but ICC does not, so calling clang functions from ICC-compiled code can cause breakage. See Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI?