The way PLT usage is specified in the SystemV ABI (and implemented in practice), is schematically somtehing like this:
# A call from somewhere in code is into a PLT slot
# (In reality not a direct call, in x64 typically an rip-relative one)
0x500:
call 0x1000
...
0x1000:
.PLT1: jmp [0x2000] # the slot for f in the binary's GOT
pushq $index_f
jmp .PLT0
...
0x2000:
# initially jumps back to .PLT to call the lazy-binding routine:
.GOT1: 0x1005
# but after that is called:
0x3000 # the address of the real implementation of f
...
0x3000:
f: ....
My question is:
isn't the 1st jmp
in the PLT slot redundant? Couldn't this work with an indirect call into the GOT instead? For example:
0x500:
call [0x2000]
...
0x1000:
.PLT1: pushq $index_f
jmp .PLT0
...
0x2000:
# initially jumps back to .PLT to call the lazy-binding routine:
.GOT1: 0x1005
# but after that is called:
0x3000 # the address of the real implementation of f
...
0x3000:
f: ....
This might have marginal performance benefits - but the reason I'm asking is a recent scramble in the linkers/elf community to come up with extra bytes in a 16-byte PLT slot to accommodate intel IBT (the search failed, and resulted in an extra .plt.sec
indirection. 1, 2)