Why is the `jmp` at the start of the PLT stub needed?

Question

The way PLT usage is specified in the SystemV ABI (and implemented in practice), is schematically somtehing like this:

# A call from somewhere in code is into a PLT slot
# (In reality not a direct call, in x64 typically an rip-relative one)
0x500:   
          call 0x1000   
...

0x1000:
   .PLT1: jmp [0x2000]  # the slot for f in the binary's GOT
          pushq $index_f
          jmp .PLT0
...
0x2000: 
# initially jumps back to .PLT to call the lazy-binding routine:
   .GOT1: 0x1005
# but after that is called:
          0x3000   # the address of the real implementation of f
...
0x3000:
     f:  ....

My question is:

isn't the 1st jmp in the PLT slot redundant? Couldn't this work with an indirect call into the GOT instead? For example:

0x500:   
          call [0x2000]
...

0x1000:
   .PLT1: pushq $index_f
          jmp .PLT0
...
0x2000: 
# initially jumps back to .PLT to call the lazy-binding routine:
   .GOT1: 0x1005
# but after that is called:
          0x3000   # the address of the real implementation of f
...
0x3000:
     f:  ....

This might have marginal performance benefits - but the reason I'm asking is a recent scramble in the linkers/elf community to come up with extra bytes in a 16-byte PLT slot to accommodate intel IBT (the search failed, and resulted in an extra .plt.sec indirection. 1, 2)

You must jump to the real function not call it. You could however replace the resolving `push`+`jmp` with a `call` if the resolver looked at the return address to figure out which function it is. — Jester, Aug 27 '23 at 13:40
@Jester (1) Isn't `call+jmp` equivalent tp `call`ing the jmp destination? (2) You can't replace `push+jmp` with `call`, because after resolution the resolver calls `f` and you want its `ret` to return to the original call site. — Ofek Shilon, Aug 27 '23 at 13:46
1) The `call` is in the original caller, the PLT should just `jmp` 2) you can if the resolver pops off the return address and uses that to determine which function it is. Also the resolver will not call `f` either, it will jump to it (or if it does, then it does a `ret` afterwards). — Jester, Aug 27 '23 at 14:06
@Jester Note that in my hypothetical scheme the call is an indirect call into the address in the *GOT*, not to the *PLT*. I still can't see why a jmp is necessary. — Ofek Shilon, Aug 27 '23 at 14:14
Ahha, you mean the original call should be indirect via GOT, okay. — Jester, Aug 27 '23 at 14:22
You still need code somewhere that does `jmp [got]` in case anybody needs a function pointer. — Jester, Aug 27 '23 at 14:27
@Jester I thought so too, but learnt here on SO (https://stackoverflow.com/questions/76243294/c-c-taking-the-address-of-a-function-imported-from-a-shared-library) that this isn't needed today. When a func address is taken the function is bound early, and the code takes the address from the (resolved) got slot. (I think the SystemV ABI spec is out of date there) — Ofek Shilon, Aug 27 '23 at 14:32
Yeah, `gcc -fno-plt` will put `call [rip + foo@GOTPCREL]` into caller so no separate `jmp` is needed. But if you *do* have a PLT, it needs to `jmp` to the target function for calls after the initial one. (After lazy resolving. Or for early binding but still using the PLT, the GOT entry will be correct even before the first call so only the `jmp [mem]` part ever executes, not the push/jmp.) — Peter Cordes, Aug 27 '23 at 18:46
@PeterCordes `-fno-plt` disables lazy binding entirely - that is not my intention. Seems to me lazy-binding could work with the hypothetical scheme above: (1) the call in code is `call [rip+foo@GOTPCREL]`, (2) the GOT entry `rip+foo@GOTPCREL` initially contains the address of `foo@PLT`, (3) `foo@PLT` sets arguments and calls the resolver which overwrites the GOT entry with the address of real `foo`, (4) on future indirect calls through the GOT `call [rip+foo@GOTPCREL]` would call `foo`'s implementation. Why is the `jmp` needed? — Ofek Shilon, Aug 27 '23 at 20:23
Hmm, perhaps that could work. It would need run-time init of each GOT entry with the right absolute address of the PLT stub, each one being different, although perhaps just adding the same constant to what's already in each of them could work, if you init them with a relative address. Also, non-lazy binding performs better in many cases, for programs that aren't too short-lived (e.g. not stuff like `clang --version`), so if you're going to change the traditional mechanism, `-fno-plt` style is a good choice. — Peter Cordes, Aug 27 '23 at 21:05
Also keep in mind that the traditional mechanism dates back to i386 (or to Unix on other platforms?). i386 didn't have x86-64's RIP-relative data addressing, *only* relative direct `jmp`/`call`, so every call-site would need to use extra something like `call [ebx+puts@GOT]` or whatever the right `@thing` is, after setting up EBX as a GOT pointer in that function. (Which it already needs for accessing global variables). Also, the PLT itself needs a position-independent way to access the GOT. (Traditionally, lazy dynamic linking rewrote a direct `jmp rel32` in the PLT, not GOT data.) — Peter Cordes, Aug 27 '23 at 21:08
You'd want a mechanism for handling `auto fptr = &puts;` function pointers. Perhaps just do early binding for those, like now when compiling a PIE, so later calls don't go through the PLT, and code that wants the function pointer just loads directly from the GOT entry. — Peter Cordes, Aug 27 '23 at 21:09
Thinking about this some more, the current PLT design already requires those `.got.plt` absolute addresses to be initialized to point into the middle of each GOT entry. So that's not something that would get worse with your modification. I think PLT entries are usually a fixed size, but I forget if it's normally a power of 2 so they're always aligned. Still, saving space might get them down to 8 bytes. And if only used on the first call, they can be packed without caring about alignment. — Peter Cordes, Aug 28 '23 at 06:29
@PeterCordes perhaps you meant "the middle of each PLT entry"? If so - yes, today GOT slots are initialized to the address of the 2nd instruction in the matching PLT entry, in my modification they'd be initialized to the 1st. Traditional PLT slots are 16 bytes and I wasn't interested in cutting this down - just better use them to accommodate for intel IBT (see links in the end of question) — Ofek Shilon, Aug 28 '23 at 07:03
@PeterCordes your other comment is also exactly how it is handled today: taking a function address forces it to be early bound. See my comment to Jester above. — Ofek Shilon, Aug 28 '23 at 07:07
Yes, typo, I meant middle of each PLT entry. Oh, right, indirect jumps that don't target an `endbr64`. That would be a showstopper for your proposal, since the first call would be an indirect jump/call to the PLT which doesn't start with `endbr64`. Although I guess you'd have room for an `endbr64` since yours wouldn't start with `jmp [rip+rel32]` as the first instruction. (Thanks for including those ABI discussion links.) I guess in the current design, early binding for functions whose address is taken makes you you don't have an indirect call to a PLT entry (without endbr). — Peter Cordes, Aug 28 '23 at 16:30

Chris Dodd · Answer 1 · 2023-08-28T05:08:00.193

2

The basic issue is that the original call (at 0x500) is being generated by the compiler, and at that point, the compiler does not know whether this symbol will eventually be in this dynamic object or not. So it generates a simple call (direct, PC relative) as that is the most efficient for the common case of a local call within a dynamic object.

It is not until the linker runs that we know if this is a symbol in another dynmic object or a globally visible one in this object (that might be overridden) or a local function call. For the latter case it will just make it a direct call, but for the former cases, it will create a PLT entry for the symbol and make the call go to the PLT entry.

Your suggestion would save a jump, but would require knowing at compile time for every call whether it needs a PLT entry or not, or would require switching between a direct and indirect call at link time based on whether the PLT was needed or not. On x86, direct and indirect calls are different sizes, so being able to change would be pretty tricky.

edited Aug 28 '23 at 05:08

answered Aug 28 '23 at 04:17

Chris Dodd

119,907
13
134
226

Calls from a shared library are generated through the PLT even for functions in the same library, by default. Symbols with “default” visibility are interposable, and interposition can happen only at runtime. – Ofek Shilon Aug 28 '23 at 05:43
1

`gcc -fno-plt` would have the same problem (of unnecessary indirection for symbols that are found in the linker inputs). It's solved by "relaxing" `call [rip+rel32]` to `a32 call rel32` direct calls with a dummy address-size prefix that has no effect on how it executes. (But is needed for the instruction to take the same space in the machine code without inserting a `nop`.) There's a special relocation type for "relaxable" calls. (example in [Can't call C standard library function on 64-bit Linux from assembly (yasm) code](https://stackoverflow.com/q/52126328) - NASM uses non-relaxable :/) – Peter Cordes Aug 28 '23 at 06:41
1

@OfekShilon: But you don't want that most of the time, and even when compiling with `-fPIE` or with `-fno-pie -no-pie` (where even by default GCC will make direct calls to other function), GCC doesn't know whether an undefined symbol will be found in another `.o` or only in a `.so` shared object. GCC handles this by either letting the linker rewrite calls to go through the PLT if needed (traditional `-fno-pie`), or by having the linker relax `call foo@plt` to `call foo` (`-fPIE` without `-fno-plt`, or visibility=hidden). Or see my previous comment re: relaxing `call [rip+rel32]`. – Peter Cordes Aug 28 '23 at 06:47

Why is the `jmp` at the start of the PLT stub needed?

1 Answers1