6

I am trying to call a function - that should have an absolute address when compiled and linked - from machine code. I am creating a function pointer to the desired function and trying to pass that to the call instruction, but I noticed that the call instruction takes at most a 16 or 32-bit address. Is there a way to call an absolute 64-bit address?

I am deploying for the x86-64 architecture and using NASM to generate the machine code.

I could work with a 32-bit address if I could be guaranteed that the executable would be for sure mapped to the bottom 4GB of memory, but I am not sure where I could find that information.

Edit: I cannot use the callf instruction, as that requires me to disable 64-bit mode.

Second Edit: I also do not want to store the address in a register and call the register, as this is performance critical, and I cannot have the overhead and performance hit of an indirect function call.

Final Edit: I was able to use the rel32 call instruction by ensuring that my machine code was mapping to the first 2GB of memory. This was achieved through mmap with the MAP_32BIT flag (I'm using linux):

MAP_32BIT (since Linux 2.4.20, 2.6) Put the mapping into the first 2 Gigabytes of the process address space. This flag is supported only on x86-64, for 64-bit programs. It was added to allow thread stacks to be allocated somewhere in the first 2GB of memory, so as to improve context- switch performance on some early 64-bit processors. Modern x86-64 processors no longer have this per‐ formance problem, so use of this flag is not required on those systems. The MAP_32BIT flag is ignored when MAP_FIXED is set.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

2 Answers2

6

related: Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code has more about JITing, especially allocating your JIT buffer near the code it wants to call, so you can use efficient call rel32. Or what to do if not.

Also Call an absolute pointer in x86 machine code is a good canonical Q&A about call or jmp to an absolute address.


TL:DR: To call a function by name, just use call func like a normal person and let the assembler + linker take care of it. Since you say you're using NASM, I guess you're actually generating the machine code with an assembler. It sounded like a more complicated question, but I think you were just trying to ask if the normal way was safe.


Indirect call r/m64 (FF /2) takes a 64-bit register or memory operand in 64-bit mode.

So you can do

func equ  0x123456789ab
; or if func is a regular label

mov   rax, func          ; mov r64, imm64,  or mov r32, imm32 if it fits
call  rax

Normally you'd put a label address into a register with lea rax, [rel func], but if that's encodeable then you'd just use call rel32.


Or, if you know what address your machine code will be stored in, you can use the normal direct call rel32 encoding, after you calculate the difference in address from the target to the end of the call instruction.

If you don't want to use an indirect call, then the rel32 encoding is your only option. Make sure your machine code goes into the low 2GiB so it can reach any address in the low 4GiB.


if I could be guaranteed that the executable would be for sure mapped to the bottom 4GB of memory

Yes, this is the default code model for Linux, Windows, and OS X. AMD64 call / jump instructions, and RIP-relative addressing, only use rel32 encodings, so all systems default to the "small" code model where code and static data are in the low 2GiB, so it's guaranteed that the linker can just fill in a rel32 to reach up to 2G forward or 2G backward.

The x86-64 System V ABI does discuss Large / Huge code models, but IDK if anyone ever uses that, because of the inefficiency of addressing data and making calls.


re: efficiency: yes, mov / call rax is less efficient. I think it's significantly slower if branch prediction misses and can't provide a target prediction from the BTB. However, even call rel32 and jmp rel32 still need the BTB for full performance. See Slow jmp-instruction for experimental results from relative jmp next_insn slowing down when there are too many in a giant loop.

With hot branch predictors, the indirect version is only extra code size and an extra uop (the mov). It might consume more prediction resources, but maybe not even that.

See also What branch misprediction does the Branch Target Buffer detect?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for the helpful info on the default code model. Do you happen to have a source I could look at? – Alexander Bolinsky Aug 15 '16 at 19:17
  • 1
    @AlexanderBolinsky: yes, that URL in the last paragraph of my answer is a link to the official ABI standard. See also the [x86 tag wiki](http://stackoverflow.com/tags/x86/info) for links to other ABIs. – Peter Cordes Aug 15 '16 at 19:22
  • @PeterCordes: What is the difference in performance between `call rel32` and `call r64`? – Rudy Velthuis Aug 15 '16 at 19:45
  • I get what they do. What is the difference in **performance** between those two? – Rudy Velthuis Aug 15 '16 at 20:06
  • @RudyVelthuis: Added some stuff about performance of direct jumps, which do still need prediction resources. Oops, I see I missed a word when reading your previous comment. Too many mental context switches between this and another answer thread. – Peter Cordes Aug 15 '16 at 20:06
  • Actually, two words. – Rudy Velthuis Aug 15 '16 at 20:09
2

In the new APX extension Intel added a new JMPABS instruction that receives a 64-bit immediate as absolute jump target

Unfortunately there's no CALLABS, so you'll need to work around it like this

nearby_trampoline:
    jmpabs target64
...
call nearby_trampoline

I don't know if it's faster than the traditional mov reg, target64; call reg sequence or not. However APX also added 16 more registers and also 3-operand integer instructions (i.e. non-destructive destinations) so register and I/O pressure likely won't exist anymore and you can just spare one register for the absolute address and use call reg directly

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • I'd expect `mov reg,imm64` / `call reg` is still faster than jmp/call. (I added an APX footnote to my answer on [Call an absolute pointer in x86 machine code](https://stackoverflow.com/a/36511513) the other day). Good point that existing code being adapted to APX can use one of the new registers for the jump target even if all 16 old regs are occupied, at the cost of three extra bytes of code-size (REX2 instead of REX for `mov r16, imm64`, and a REX2 on the `call r16`, vs. `mov rax, imm16` / `call rax`) – Peter Cordes Aug 01 '23 at 04:42
  • 2
    (Also, I'd recommend against using the word "far" in you example label names. It's a long distance, but it's not a `far` jump to a new `cs`. Perhaps `nearby_abs_jmp`, or `nearby_trampoline`, since the point is that the `call` target is within rel32 range, probably very close to the call site). Hrm, if we're putting other stuff near the call site, maybe just put the absolute address as data and `call qword [RIP+rel32]`, like how `gcc -fno-plt` does code-gen. But then you're dependent on d-cache hits to not stall. – Peter Cordes Aug 01 '23 at 04:42
  • @PeterCordes I don't think your mention of `mov rax, imm16` is correct – ecm Aug 01 '23 at 05:26
  • @ecm: Oops, I meant `rax, imm64` / `call rax`. of course. I typoed 16 since I guess I was still thinking about the register number I picked. – Peter Cordes Aug 01 '23 at 05:29