4

I'm writing a JIT compiler in C for x86_64 linux.

Currently the idea is to generate some bytecode in a buffer of executable memory (e.g. obtained with an mmap call) and jump to it with a function pointer.

I'd like to be able to link multiple blocks of executable memory together such that they can jump between each other using only native instructions.

Ideally, the C-level pointer to an executable block can be written into another block as an absolute jump address something like this:

unsigned char *code_1 = { 0xAB, 0xCD, ... };
void *exec_block_1 = mmap(code1, ... );
write_bytecode(code_1, code_block_1);
...
unsigned char *code_2 = { 0xAB, 0xCD, ... , exec_block_1, ... };
void *exec_block_2 = mmap(code2, ... );
write_bytecode(code_2, exec_block_2); // bytecode contains code_block_1 as a jump
                                      // address so that the code in the second block
                                      // can jump to the code in the first block

However I'm finding the limitations of x86_64 quite an obstacle here. There's no way to jump to an absolute 64-bit address in x86_64 as all available 64-bit jump operations are relative to the instruction pointer. This means that I can't use the C-pointer as a jump target for generated code.

Is there a solution to this problem that will allow me to link blocks together in the manner I've described? Perhaps an x86_64 instruction that I'm not aware of?

1201ProgramAlarm
  • 32,384
  • 7
  • 42
  • 56
AlexJ136
  • 1,272
  • 1
  • 12
  • 21
  • Hmm, maybe you are over-estimating the need to generate more than 2 gigabytes of code. The advantage of a jitter is that you can always tell that you need to fall back to an indirect jump, like `jmp rax`. – Hans Passant Apr 22 '15 at 13:27
  • @HansPassant That's a good point. At the moment my goal is just to implement the simplest thing that works, and worry about performance later. – AlexJ136 Apr 22 '15 at 13:36
  • Also related: [Handling calls to far away intrinsic functions in a JIT](//stackoverflow.com/q/54947302)/ re: allocating blocks near each other with `mmap` with a hint address, so you can use a direct `call` or `jmp rel32` encoding. – Peter Cordes Apr 12 '19 at 08:38

2 Answers2

3

If you know the addresses of the blocks at the time when you are emitting the jump instructions, you can just check to see if the distance in bytes from the address of the jump instruction to the address of the target block fits within the 32-bit signed offset of the jXX family of instructions.

Even if you mmap each block separately, chances are pretty good that you won't get two neighbouring (in the control-flow sense) blocks that are more than ±2GiB apart. That being said, there are several good reasons not to map each block separately like that. First of all, mmap's minimum unit of allocation is (almost by definition) a page, which is probably at least 4KiB. That means that the unused space after the code for each block is wasted. Secondly, packing the basic blocks more tightly increases the utilization of the instruction cache and the chances of a shorter jump encoding being valid.

Perhaps an x86_64 instruction that I'm not aware of?

Incidentally, there is an instruction for loading a 64-bit immediate into rax. The GNU toolchain refers to it as movabs:

0000000000000000 <.text>:
   0:   49 b8 ff ff ff ff ff    movabs rax,0x7fffffffffffffff
   7:   ff ff 7f

So if you really want to, you can simply load the pointer into rax and use a jump to register.

Martin Törnwall
  • 9,299
  • 2
  • 28
  • 35
  • iTLB effects would be a more relevant concern, unless you're really only talking about tiny basic blocks. i-cache locality is in 64-byte chunks (or 128-byte considering L2 adjacent-line prefetch), and short-jmp is `rel8` -128..+127 byte displacements. But yes, movabs / `jmp *rax` is the obvious choice. See also [Handling calls to far away intrinsic functions in a JIT](//stackoverflow.com/q/54947302) – Peter Cordes Apr 12 '19 at 07:14
0

Hmm I'm not sure if I clearly understood your question and if that a proper answer. it's quite a convoluted way to achieve this:

    ;instr              ; opcodes [op size] (comment)
    call next           ; e8 00 00 00 00 [4] (call to get current location)
next:
    pop rax             ; 58 [1]  (next label address in rax)
    add rax, 12h        ; 48 83 c0 12 [4] (adjust rax to fall on landing label)
    push rax            ; 50 [1]  (push adjusted value)
    mov rax, code_block ; 48 b8 XX XX XX XX XX XX XX XX [10] (load target address)
    push rax            ; 50 [1] (push to ret to code_block)
    ret                 ; c3 [1] (go to code_block)
landing:    
    nop
    nop

e8 00 00 00 00 is just there to get the current pointer on top of stack. Then the code adjusts rax to fall on landing label later. You'll need to replace the XX (in mov rax, code_block) by the virtual address of code block. The ret instruction is used as a call. When caller returns, the code should fall on landing.

Is that this kind of thing you're trying to achieve?

Neitsa
  • 7,693
  • 1
  • 28
  • 45
  • Thanks. doing `'mov rax, code_block'`, `'push rax'`, `'ret'` has the effect I'm looking for. Ideally I'd like something that doesn't need to touch the stack, but this will suffice for now. – AlexJ136 Apr 24 '15 at 16:50
  • 2
    Be warned, though: Modern Intel processors keep track of return addresses for branch target prediction. Manually pushing another address and then using `ret` to jump to it interferes with the predictor and will likely result in lower performance. See the optimization manual for more information. – Martin Törnwall Apr 26 '15 at 06:41
  • 1
    @AlexJ136: This is almost maximally inefficient compared to `call rax` or `jmp rax`. x86-64 has register-indirect call and jmp, use them! push/ret is much worse because it breaks the Return-Address predictor Stack by mismatching call and ret. (The `call next` is a special case and doesn't go on the RAS in most CPUs http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/#call0, so this leaves it unbalanced for future returns *as well as* forcing this `ret` to mispredict.) Using `call`/`pop` to find out your own address is also insane in x86-64; that's why we have RIP-relative LEA. – Peter Cordes Apr 12 '19 at 07:29