PLT stubs intentionally use longer immediates and jump displacements than necessary so they're a constant size even when you have enough PLT entries that the jmp ?_001
in the fall-through path needs a rel32
to reach from later PLT entries.
They're automatically generated by the linker when linking code that used call printf wrt ..plt
, or when linking a non-PIE that just used call printf
.
You can avoid the PLT entirely by writing call [rel printf wrt ..got]
, like GCC does when you compile with -fno-plt
. This does early binding (instead of lazy), resolving all the GOT at startup before your _start
.
See Can't call C standard library function on 64-bit Linux from assembly (yasm) code. Using default rel
lets you leave out the explicit rel
part of the addressing mode. The equivalent AT&T syntax is call *printf@GOTPCREL(%rip)
I don't know if this fixed-width array of PLT stubs is strictly necessary for anything at run time. e.g. lazy dynamic linking only modifies the GOT, not the PLT itself, because modern PLTs use an indirect jump. The push 0
is pushing an index of the PLT entry, but I don't think anything uses it to actually find the address of the machine code of that PLT stub, only indexing a GOT entry.
At this point it might just be a missed optimization in the linker. NASM isn't generating it so you can't really do anything about it.
I seem to recall historically seeing a jmp rel32
as the first instruction of PLT stubs in 32-bit code, not a jmp [mem]
, but maybe that was just a guess at how PLT stubs worked before I really knew much. If they ever worked that way, lazy dynamic linking would modify the actual PLT itself to fix up the relative jump target, so indexing the machine code of the PLT entry would be important. (And thus having every entry be fixed width would be important).
But even 32-bit code doesn't use jmp rel32
these days so the PLT stubs are read-only. And in 64-bit code, jmp rel32
can only reach +-2GiB so wouldn't be usable to reach libraries mapped to a random address.
Note that those longer-than-needed instructions only ever run once for each PLT stub. After the first call, the indirect jmp
target will be the function in the library. (On the first call, the jmp
target will be the next instruction after the jmp
.)
The padding might possibly be a good thing: too many jmp
instructions in a single 16-byte block of code is bad for branch predictors on some CPUs. But I think the limit is like 3 or 4 jumps in a 16-byte block of machine code for some AMD or Core 2, so that wouldn't be hit anyway with 6-byte jmp [RIP+rel32]
+ 2-byte push imm8
+ 2-byte jmp rel8
.