1

I am currently trying to JIT via python. I found peachpy via another SO question. For most part this is easy, but I am failing to use external c-functions. I want to call putchar, so a function with a single argument. Since I am on windows, with x86-64, I expect the single argument to be put into rcx, and then running call with the function-pointer address. For this I wrote this code:

from peachpy import *
from peachpy.x86_64 import *
import ctypes


putchar_address = ctypes.addressof(ctypes.cdll.msvcrt.putchar)
c = Argument(uint64_t)

with Function("p", (c,), int64_t) as asm_function:
    LOAD.ARGUMENT(rcx, c)
    MOV(r8, putchar_address)
    CALL(r8)
    RETURN(rax)


raw = asm_function.finalize(abi.detect()).encode()
python_function = raw.load()

print(python_function(48))

This crashes with OSError: exception: access violation writing 0x0000029E58C1A978 on the final code.

I looked through lots of other SO answers, but none really help to solve this problem, and the code is actually the result of these. The most useful was this one: Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code

Edit: A few more things I tried.

PeachPy does specifically not expose rsp directly, claiming that it already deals with it correctly. But I can still influence it directly, leading to this code:

from peachpy.x86_64.registers import rsp
#...
    LOAD.ARGUMENT(rcx, c)
    SUB(rsp, 40)
    MOV(r8, putchar_address)
    CALL(r8)
    ADD(rsp, 40)
    RETURN(rax)

This changes the error to a crash with exit code 0xC0000409, meaning stack access beyond top of stack.

Here are the disassemble result of what PeaachPy generates:

Without rsp

0:  49 b8 a8 a8 1a 84 1f    movabs r8,0x21f841aa8a8
7:  02 00 00
a:  41 ff d0                call   r8
d:  c3                      ret 

With rsp

0:  48 83 ec 28             sub    rsp,0x28
4:  49 b8 a8 98 ad 9e ac    movabs r8,0x1ac9ead98a8
b:  01 00 00
e:  41 ff d0                call   r8
11: 48 83 c4 28             add    rsp,0x28
15: c3                      ret 

(From https://defuse.ca/online-x86-assembler.htm)

Based on the output of the c compiler (here: https://godbolt.org/z/BKgk7Y), I created the following code

    MOV([rsp + 16], rdx)
    MOV([rsp + 8], rcx)
    SUB(rsp, 40)
    MOV(rcx, [rsp + 56])
    CALL([rsp + 48])
    ADD(rsp, 40)
    RETURN(rax)

which creates the same assembler code as the c compiler:

0:  48 89 54 24 10          mov    QWORD PTR [rsp+0x10],rdx
5:  48 89 4c 24 08          mov    QWORD PTR [rsp+0x8],rcx
a:  48 83 ec 28             sub    rsp,0x28
e:  48 8b 4c 24 38          mov    rcx,QWORD PTR [rsp+0x38]
13: ff 54 24 30             call   QWORD PTR [rsp+0x30]
17: 48 83 c4 28             add    rsp,0x28
1b: c3                      ret 

This fails, meaning the problem is not in the generated code. (And I didn't use putchar, and I still get the same exit code 0xC0000409)

MegaIng
  • 7,361
  • 1
  • 22
  • 35
  • Windows x64 takes the first arg in RCX, not RDX. Look at C compiler output for an example. Also it requires the stack aligned by 16 *before* a call (which pushes a return address), and 32 bytes of shadow space above RSP. So you probably want `sub rsp, 40` before a call, or just tailcall with `jmp`. – Peter Cordes Jun 17 '20 at 20:16
  • @PeterCordes I will add a bit of information. I forgot that I already tried that. – MegaIng Jun 17 '20 at 20:23
  • I'd suggest making your [mcve] be one that could plausibly work, avoiding that obvious bug, then! Windows x64 definitely takes the first arg in RCX, and also requires shadow space. https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019. – Peter Cordes Jun 17 '20 at 20:25
  • @PeterCordes It doesn't change anything in the result. Peachpy even optimizes that away, since the value is already in `rcx`. – MegaIng Jun 17 '20 at 20:37
  • (And both 0x40 and 40 lead to the same result) – MegaIng Jun 17 '20 at 20:38
  • 0x40 doesn't align the stack. The correct value is 40 = 0x28 because RSP is 8 bytes away from a 16-byte boundary on function entry, like I explained in my first comment. If you're not crashing on a misaligned access somewhere in `putchar` then that's not a problem, but it would be better to show a test with an actual correct value instead of obviously wrong trial and error. Anyway, run it under a debugger and see what instruction (in which code) is causing the exception. e.g. if it's somewhere after `putchar` returned, then it probably corrupted the stack. – Peter Cordes Jun 17 '20 at 21:00
  • Also, you might have an easier time if you call a known simple function that doesn't do any I/O, like `isdigit(int c)`, to see if you can single-step into that without crashing. (Pick something obscure so you can set a breakpoint on it and step back out to get a look at the JITed code with a debugger.) – Peter Cordes Jun 17 '20 at 21:03
  • @PeterCordes Do you have another idea based on the last edit? – MegaIng Jun 17 '20 at 22:19
  • You forgot to enable optimization on Godbolt, so you got pointless anti-optimized code that stores/reloads its register args. It's still equivalent to your "With rsp" version, except its taking the function pointer as a function arg, instead of having the JIT embed it. You could do the same more easily with `jmp rdx` (as the whole function: tailcall with args in ECX, using the function pointer in RDX) – Peter Cordes Jun 17 '20 at 22:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216171/discussion-between-megaing-and-peter-cordes). – MegaIng Jun 17 '20 at 22:23
  • Your asm looks correct for the Windows x64 ABI, so presumably your exit code "0xC0000409" is coming from some other problem. I haven't used peachpy; maybe you need to do something special to return back to Python? I'd look for an example if peachpy has one. – Peter Cordes Jun 17 '20 at 22:24
  • Looking at the github page, I don't find anything, and just removing the call instructions removes all problems, but that doesn't really help. – MegaIng Jun 17 '20 at 22:33
  • Wait really? It works if you tailcall with a `jmp` instead of `call`/`ret`? – Peter Cordes Jun 17 '20 at 22:34
  • No. At this point Peachpy breaks done for some reason. It doesn't accept the code if it contains jmp instead or in addition to call/ret. I might be able to work around this with a little bit of hacking, but tail call is not an option for my final product anyway. – MegaIng Jun 17 '20 at 22:36
  • Ok, did you mean "removes all problems" that if you just `xor eax,eax` / `RETURN(rax)` you can call the function successfully via Python? So the whole PeachPy thing is presumably working overall, but breaking when you call `putchar` from the C library. Maybe can put an `int3` in there as a software breakpoint. I wonder if there's some problem in calling stdio functions, like whether that could conflict with how Python itself has initialized stdio? – Peter Cordes Jun 17 '20 at 23:48
  • 1
    @PeterCordes I actually figured the first half out. The address I git from ctypes is not the pointer to the function, but a pointer to a pointer. So I can now call a function, but if I tryto call two functions it breaks. I will post an answer later. – MegaIng Jun 18 '20 at 07:39
  • Oh, so you can leave out the `ctypes.addressof` and get a plain function pointer, hopefully? Instead of the place in memory where dynamic linking stored the function address. – Peter Cordes Jun 18 '20 at 13:11
  • @PeterCordes Actually, the opposite. I have to first deference the pointer once, and then use the resulting pointer as the address. I will write the answer soon. – MegaIng Jun 18 '20 at 15:02

1 Answers1

1

With the help of @PeterCordes I figured out the important problems.

  • I misunderstood the windows call convention. You need to reserve shadow space and align the stack, so 'sub rsp, 40' is required.
  • ctypes.addressof(ctypes.cdll.msvcrt.putchar) gives not the start of the code, but the address of a pointer to the start of the code.

Problem 1 is easy to solve, and Problem 2 needed a bit of tinkering. In the end, this code works:

c_void_p_p = ctypes.POINTER(ctypes.c_void_p)

putchar_address = ctypes.addressof(ctypes.cast(ctypes.cdll.msvcrt.putchar, c_void_p_p).contents)
func_ptr = Argument(ptr())
c = Argument(uint64_t)

with Function("p", (c,), int64_t) as asm_function:
    MOV(r12, putchar_address)
    SUB(rsp, 40)
    CALL(r12)
    ADD(rsp, 40)
    RETURN()

raw = asm_function.finalize(abi.detect()).encode()
print(raw.code_section.content.hex())
python_function = raw.load()
print(python_function(54))

This generates this assembly:

0:  41 54                   push   r12
2:  49 bc 90 77 75 4d fa    movabs r12,0x7ffa4d757790
9:  7f 00 00
c:  48 83 ec 28             sub    rsp,0x28
10: 41 ff d4                call   r12
13: 48 83 c4 28             add    rsp,0x28
17: 41 5c                   pop    r12
19: c3                      ret 

And works exactly as expected.

(Just remember which registers are saved/need to be saved.)

MegaIng
  • 7,361
  • 1
  • 22
  • 35
  • You can use RAX instead of R12, avoiding the need for push/pop instructions. (And saving a REX prefix on the `call rax`). But yeah, if you want to make repeated calls to the same function, you can keep its address in a register. – Peter Cordes Jun 18 '20 at 15:29