JITtting with limited knowledge of the target architecture

Question

I've implemented a little bytecode interpreter using computed goto (see here if not familiar).

It would seem that perhaps it's possible to do simple JITting by copying the memory between labels, thus optimizing out the jumps. For example, say I've got the following in my interpreter:

op_inc: val++; DISPATCH();

I would change this to:

op_inc: val++;
op_inc_end:

When JITting, I would append memory between the labels to my output:

memcpy(jit_code+offset, &&op_inc, &&op_inc_end - &&op_inc);

(jit_code is marked executable using mmap)

Finally, I would use computed goto to jump to the beginning of the copied machine code:

goto *(void*)jit_code

Will this work? Is there something missing in my mental model of machine code that would thwart this idea?

Let's assume that code and data share the same address space. Let's also assume PIC.

Update

Looking at the example in linked article, after removing DISPATCH, we have:

        do_inc:
            val++;
        do_dec:
            val--;
        do_mul2:
            val *= 2;
        do_div2:
            val /= 2;
        do_add7:
            val += 7;
        do_neg:
            val = -val;
        do_halt:
            return val;

The generated code for do_inc (no optimization) is simply:

Ltmp0:                                  ## Block address taken
## %bb.1:
    movl    -20(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -20(%rbp)

(followed directly by do_dec). It looks like this little snippet could be clipped out.

@RossRidge My `op_return` would copy the return instruction, but I suppose there's a problem with the full function epilogue? — Taylor, Jul 03 '19 at 17:37
You need [position independent code](https://en.wikipedia.org/wiki/Position-independent_code) that is complete and self-contained. — user3386109, Jul 03 '19 at 17:41
@RossRidge all this happens in one stack frame. So returning from the function is exactly what I want to happen at the end of `jit_code`. — Taylor, Jul 03 '19 at 17:58
Imagine you have three operations. OpX expects input in register B and puts output in register A, which we'll denote `OpX(B->A)`. The other operations are `OpY(B->A)` and `OpZ(B,C->A)`. So if you want the outputs of OpX and OpY to be the input to OpZ, you can't simply concatenate the three code blocks. You need glue logic to move the variables around. — user3386109, Jul 03 '19 at 18:03
@user3386109 My interpreter is already jumping around all over the function using computed goto, so there's no precondition for arriving at a label. It's effectively concatenating the code blocks for each instruction. — Taylor, Jul 03 '19 at 18:08
And that's where this discussion ends, because those of us reading your question can't see what you've already implemented. And with a problem like this, it's all about the details. — user3386109, Jul 03 '19 at 18:14
@user3386109 Um, it's the standard computed goto interpreter like the various ones online. I guess I can add more if you really need it. — Taylor, Jul 03 '19 at 18:16
That comment actually does clear things up a little. One of the issues you'll face is PC relative addressing which is used, for example, to access string literals and `const` data. — user3386109, Jul 03 '19 at 18:22
@RossRidge: The `&&` feature they are using is a GCC feature, and GCC supports subtracting pointers to void. — Eric Postpischil, Jul 03 '19 at 18:35
Yes it could be clipped out because you compiled with no optimization. But only if jumped to with the same stack layout as this function, which precludes calling it as a function. Besides, that's hardly useful. Anything you could do by copying around blocks of un-optimized code could be done more efficiently by writing an interpreter in portable C. — Peter Cordes, Jul 03 '19 at 22:45
@PeterCordes sometimes interesting insights come out of doing things that aren't useful. — Taylor, Jul 03 '19 at 22:50
@Taylor: that's a good point; upvoted your question now that you added more detail. I'm not saying (here or in my answer) that it's a bad *question*, just that it's not something you'd probably actually want to use directly, in this form. And I don't see a reliable/safe way to extract useful snippets from functions in optimized compiler output. It seems to me that this is probably not a useful avenue to investigate further, other than to learn more about how machine-code works. — Peter Cordes, Jul 04 '19 at 00:25

score 5 · Answer 1 · answered Jul 03 '19 at 18:50

5

Here's another reason this won't work on one architecture:

ARM Thumb code uses out-of-line immediate values with PC-relative addressing. An operation like

a += 12345;

could be compiled as:

ldr  r3, [pc, #<offset to constant>]
adds r0, r4, r3

… other unrelated code …

bx   lr      ; end of the function
.word 12345  ; oh, and here's the constant

Copying a fragment of this function would leave the reference to the constant unresolved; it'd end up using an unexpected value from elsewhere in memory.

answered Jul 03 '19 at 18:50

Or across most ISAs, if the block you copy has any `call` instructions to other functions, you have the same problem. Most ISAs encode function-call instructions (x86 `call`, ARM `bl`, etc.) with a PC-relative displacement. Besides literal pools, any code that does PC-relative access to static constants/data will break if you move it relative to the rest of the program. e.g. x86-64 RIP-relative addressing modes. – Peter Cordes Jul 03 '19 at 19:33

score 3 · Answer 2 · answered Jul 03 '19 at 17:34

3

No, this won't work in general. Here's just one of many reasons why:

Consider the AVR architecture. It's a Harvard architecture, so code and data don't live in the same address space. On it, your code will make a copy of whatever data happens to live in the data memory with the same address as the code memory you meant to copy, but will then ignore that anyway, and run whatever code is in the code memory with the same address as the data memory's destination location.

answered Jul 03 '19 at 17:34

Joseph Sible-Reinstate Monica

45,431
5
48
98

Suppose I know it's not a Harvard architecture. What else? – Taylor Jul 03 '19 at 17:38
3

Even if the code is not a Harvard architecture and is position-independent, it might rely on its relation to other code, such as the distance to `add_stuff`. It might rely on registers the compiler sets up prior to the `op_add` label. The code between `op_add` and `op_add_end` might include additional instructions not actually needed for the `add_stuff` call but that the compiler decided were useful for neighboring things, but which will screw up elsewhere. – Eric Postpischil Jul 03 '19 at 18:06
@EricPostpischil that seems like a good answer. You could post it, maybe with some more specifics. – Taylor Jul 03 '19 at 18:18

score 3 · Answer 3 · answered Jul 03 '19 at 19:18

3

Not only do underlying instruction set architectures not work that way (as other answers have noted); C doesn't work that way. There's nothing to constrain the compiler to placing source-level code paths contiguously in the same pattern as labels between them. It's free to make all sorts of transformations including common subexpression elimination across different paths, outlining, etc. And of course it can also use the assumption that the code is executing in the function it's written as part of, rather than in some other context, to derive invariants that take part in transformations.

TL;DR: C is not "high-level assembly" and therefore you can't use it where an assembler is needed.

answered Jul 03 '19 at 19:18

R.. GitHub STOP HELPING ICE

208,859
35
376
711

I could imagine it working sometimes in debug mode (`gcc -O0`), where every C statement *is* compiled to a separate block of asm. (Especially if only called from inside the same function that the machine code was copied from, so the stack frame layout matches). But yeah this is pretty much a non-starter; nobody wants debug-mode machine code. Whatever you were hoping to gain by JITing, you could get better results with plain C and letting the optimizer do its thing. – Peter Cordes Jul 03 '19 at 19:29
Can you give a precise example of such a transformation and how it would mess things up? Assume we aren't dealing with PC-relative addressing. – Taylor Jul 03 '19 at 22:00
@Taylor: For example, the code might contain a branch that the compiler deems unlikely or "slow no matter what", and decide to move that out to the end of the function (or even to a separate section) and jump to it, then jump back at the end, so as to improve cache coherency for the "hot paths". – R.. GitHub STOP HELPING ICE Jul 03 '19 at 22:07

Peter Cordes · Answer 4 · 2019-07-04T00:17:59.343

Update ... It looks like this little snippet could be clipped out.

Yes it could be clipped out because you compiled with no optimization, and the code is only accessing a local on the stack.

But only if jumped to with the same stack layout as this function, which precludes calling it as a function. But yes with a computed goto I guess that works.

If you were accessing anything in static storage (constants or global/static variables), x86-64 compilers would use RIP-relative addressing modes like var(%rip), which would break if you changed the position of the code relative to the data.

Calls to other functions will also break, because they'll compile to call rel32 where the target address is encoded relative to the call site.

Overall this is hardly useful. Anything you could do by copying around blocks of un-optimized code could probably be done more efficiently by writing an interpreter in portable C. Your idea will very easily break with optimization enabled.

For example, if you wanted to increment n times, introducing a store/reload cycle adds about 5 cycles of store-forwarding latency into the critical path if you just copy this block n times.

Also, you need __builtin___clear_cache on the range you copied if you were to enable optimization. And yes this applies even on x86, where it doesn't actually clear cache but does still stop dead-store elimination from removing the memcpy.

If you want to spend more time up front to make non-terrible machine code, use a JIT engine like LLVM.

Your idea is viable as a toy / experiment in some limited cases, but I would highly recommend against using it for any real use, especially if performance is your goal.

This is not a friendly answer. Notice the worlds of judgment: "hardly useful", "non-terrible". And the unnecessary bold-facing. — Taylor, Jul 03 '19 at 22:50
@Taylor: I'm not saying you're a bad person for having this idea. I'm trying to give an objective assessment of the viability of this *idea*. i.e. I'm trying to help you reject this idea as maybe possible in theory in super-controlled circumstances, but not useful for performance. Similar to what Brendan's answer says: this can only work without optimization. IDK what you're hoping to accomplish with this. As just a "playing with machine code" / silly computer tricks learning exercise, sure. — Peter Cordes, Jul 04 '19 at 00:13
I've already accomplished a lot here! I've learned about PC-relative addressing. I didn't know what a Harvard architecture was (I don't do embedded). I've also taken a closer look at assembly, which I haven't really done since grad school. Pity I encountered the usual SO attitude, because all the subtle ways people tell you that you're stupid on this site can really add up. — Taylor, Jul 04 '19 at 01:20

score 1 · Answer 5 · answered Jul 03 '19 at 19:33

1

Will this work?

Yes; this can work.

However, it can only work for very limited cases where:

there's no arguments to your functions. For example, if you're JITting a language like brainfuck (see https://en.wikipedia.org/wiki/Brainfuck ) then you might have functions like "void increment(void); and void putchar(void) which modify global state (e.g. struct machine_state { void * ptr; }).
There is either no position independence (the address of the global state is a fixed/constant address), or you can tell the compiler to reserve a register to use as "pointer to global state" (which is something that GCC supports).
The target machine supports some way to treat data as code. Note: This is relatively irrelevant (if a machine can't do this somehow, then you wouldn't be able to execute files either).
You use barriers to prevent the compiler from shifting code out of your labels. Note: This is also relatively irrelevant (e.g. if you use inline assembly to define the label and mark the inline assembly as volatile, and put everything in the clobber list, then..).
You're not using "pure portable C". Note: You're already using compiler specific extensions (computed goto) so I assume this is also relatively irrelevant.

For all of these limitations; the only one that's likely to matter in practice is the first one - anything worth JITting will be too complicated (e.g. you'll want functions like void add(int register_number_1, int register_number_2);), and as soon as you try to pass arguments to your functions you'll end up depending on target specific calling conventions.

answered Jul 03 '19 at 19:33

Brendan

35,656
2
39
66

AVR does have special instructions to load/store to program memory; the point is that `memcpy` won't compile to use those instructions. But yes that's a pretty niche objection; most true Harvard CPUs these days are lightweight microcontrollers that are normally hooked up to Flash, not DRAM, for code. You would need GCC's `__builtin___clear_cache` on the range you copied. – Peter Cordes Jul 03 '19 at 19:43
Worth mentioning that "normal" code for x86-64 will violate your requirements: access to static data *does* use PC-relative addressing (aka RIP-relative). So do calls to existing functions, using `call rel32`. – Peter Cordes Jul 03 '19 at 19:45
`asm volatile` with clobbers on all registers isn't 100% guaranteed to stop optimizations that break this. e.g. GCC might CSE and hoist something expensive, and then (because of the clobbers) spill it to the stack ahead of the first asm statement, instead of just moving the computation to after. Plus, that means you have to know the register names for the target architecture. – Peter Cordes Jul 03 '19 at 19:47
@PeterCordes: You're over-thinking it. When all the code is trivial (no control flow, no loops, nothing that can be optimised - e.g. like `void increment(void) { ptr++; }`) these problems don't exist; and the restrictions (e.g. no args) mean that all code must be trivial (and therefore these problems can't exist). – Brendan Jul 03 '19 at 19:56
The OP was talking about picking out code from the *middle* of a function, between two C labels. It wasn't clear if you were still talking about that with calls *to* such a function, but now I guess you're just talking about putting labels at the start/end of that trivial `increment` function. And putting the `memcpy` somewhere else in the function? Or maybe you mean using the function name to get the start address, not depending on local labels. Anyway, of course the RIP-relative addressing problem will still exist unless you work around it like you suggest. – Peter Cordes Jul 03 '19 at 20:04
@Brendan what about the case of `val++` as above? (`val` being a local variable in the function) Why would that not work? – Taylor Jul 03 '19 at 21:46
@Taylor: That won't work because it depends on either the compiler generated stack layout and function's epilogue (which isn't between the labels) or the register allocation strategy the compiler felt like using. – Brendan Jul 04 '19 at 05:08

JITtting with limited knowledge of the target architecture

5 Answers5