I'm writing my own JIT-interpreter. How do I execute generated instructions?

Question

I intend to write my own JIT-interpreter as part of a course on VMs. I have a lot of knowledge about high-level languages, compilers and interpreters, but little or no knowledge about x86 assembly (or C for that matter).

Actually I don't know how a JIT works, but here is my take on it: Read in the program in some intermediate language. Compile that to x86 instructions. Ensure that last instruction returns to somewhere sane back in the VM code. Store the instructions some where in memory. Do an unconditional jump to the first instruction. Voila!

So, with that in mind, I have the following small C program:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    int *m = malloc(sizeof(int));
    *m = 0x90; // NOP instruction code

    asm("jmp *%0"
               : /* outputs:  */ /* none */
               : /* inputs:   */ "d" (m)
               : /* clobbers: */ "eax");

    return 42;

}

Okay, so my intention is for this program to store the NOP instruction somewhere in memory, jump to that location and then probably crash (because I haven't setup any way for the program to return back to main).

Question: Am I on the right path?

Question: Could you show me a modified program that manages to find its way back to somewhere inside main?

Question: Other issues I should beware of?

PS: My goal is to gain understanding, not necessarily do everything the right way.

Thanks for all the feedback. The following code seems to be the place to start and works on my Linux box:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>

unsigned char *m;

int main() {
        unsigned int pagesize = getpagesize();
        printf("pagesize: %u\n", pagesize);

        m = malloc(1023+pagesize+1);
        if(m==NULL) return(1);

        printf("%p\n", m);
        m = (unsigned char *)(((long)m + pagesize-1) & ~(pagesize-1));
        printf("%p\n", m);

        if(mprotect(m, 1024, PROT_READ|PROT_EXEC|PROT_WRITE)) {
                printf("mprotect fail...\n");
                return 0;
        }

        m[0] = 0xc9; //leave
        m[1] = 0xc3; //ret
        m[2] = 0x90; //nop

        printf("%p\n", m);


asm("jmp *%0"
                   : /* outputs:  */ /* none */
                   : /* inputs:   */ "d" (m)
                   : /* clobbers: */ "ebx");

        return 21;
}

Another option is to just interpret the instructions or intermediate code w/o executing anything directly. — Alexey Frunze, Jan 26 '12 at 09:37
@Alex: that's another option to implement a language, but by definition it's not a JIT. — Steve Jessop, Jan 26 '12 at 10:06

score 8 · Accepted Answer · answered Jan 26 '12 at 08:43

8

Question: Am I on the right path?

I would say yes.

Question: Could you show me a modified program that manages to find its way back to somewhere inside main?

I haven't got any code for you, but a better way to get to the generated code and back is to use a pair of call/ret instructions, as they will manage the return address automatically.

Question: Other issues I should beware of?

Yes - as a security measure, many operating systems would prevent you from executing code on the heap without making special arrangements. Those special arrangements typically amount to you having to mark the relevant memory page(s) as executable.

On Linux this is done using mprotect() with PROT_EXEC.

answered Jan 26 '12 at 08:43

NPE

486,780
108
951
1,012

1

In addition, the instruction cache generally does not monitor the underlying memory, so an explicit cache flush may be required before executing the jump. – Simon Richter Jan 26 '12 at 08:57
1

@Simon: agreed, and in general that's "after you've finished writing the instructions to memory, but before executing it". So in my experience you write the code to do it just after you've finished writing, rather than just before you execute. In this example code those are the same place, but in practice you might execute more times than you write. And it's important that as Simon remarks, flush the instruction cache rather than the data cache. – Steve Jessop Jan 26 '12 at 10:01
You don't need to flush the I cache on x86, SMC only doesn't work if it's extremely close to eip. Moderately close SMC imposes huge penalties. – harold Jan 26 '12 at 13:08
There is an SO question somewhere about self modifying code and I remember figuring out you have to mmap on aligned boundaries in order to access that memory. Anyone know where that ticket is, that should show you how to generate instructions in memory and then execute them. – old_timer Jan 26 '12 at 15:07
here it is http://stackoverflow.com/questions/4812869/how-to-write-self-modifying-code-in-x86-assembly – old_timer Jan 26 '12 at 15:08
that is from application space, if you are in a kernel driver you likely have better access and simply write the instructions to ram and branch. – old_timer Jan 26 '12 at 15:10

score 3 · Answer 2 · answered Jan 26 '12 at 09:11

If your generated code follows the proper calling convention, then you can declare a pointer-to-function type and invoke the function this way:

typedef void (*generated_function)(void);

void *func = malloc(1024);
unsigned char *o = (unsigned char *)func;
generated_function *func_exec = (generated_function *)func;

*o++ = 0x90;     // NOP
*o++ = 0xcb;     // RET

func_exec();

I'm writing my own JIT-interpreter. How do I execute generated instructions?

2 Answers2

Linked