2

I'm trying to implement an algorithm in assembly (MASM64, Windows, x64) using jump tables. Basic idea is: there are 3 different types of operations I need to do with data. The operations depend on some variables, but I found it tedious to implement a lot of switching and many long implementations.

PUBLIC superFunc@@40 ;__vectorcall decoration
.DATA
ALIGN 16
jumpTable1 qword func_11, func_12, func_13, func_14
jumpTable2 qword func_21, func_22, func_23, func_24
jumpTable3 qword func_31, func_32, func_33, func_34

.CODE
superFunc@@40 PROC
        ;no stack actions, as we should do our stuff as a leaf function
        ;assume the first parameter (rcx) is our jumpTable index, and it's
        ;the same index for all functions
        mov     rax,    qword ptr [rcx*8 + offset jumpTable1]
        mov     r10,    qword ptr [rcx*8 + offset jumpTable2]
        mov     r11,    qword ptr [rcx*8 + offset jumpTable3]
        jmp     qword ptr [rax]
superFunc@@40 ENDP
func_11:
        [...] do something with data
        jmp     qword ptr [r10]
func_12: ; shorted, simply does something else to the data and jumps thru r10
[...]
func_21:
        [...] do something with data
        jmp     qword ptr [r11]
func_22: ; shorted, simply does something else to the data and jumps thru r11
[...]
func_31:
        [...] do something with data
        ret
func_32: ; shorted, simply does something else to the data and returns
END

Now this compiles well, but it doesn't link with my main C++ Plugin (a DLL), giving me the following linker errors:

LINK : warning LNK4075: ignoring '/LARGEADDRESSAWARE:NO' due to '/DLL' specification
error LNK2017: 'ADDR32' relocation to 'jumpTable1' invalid without /LARGEADDRESSAWARE:NO

How can I implement something like this correctly? Maybe better phrased: How do I implement jump tables and jumping/calling to addresses from those tables correctly in MASM64?

P.S.: I could set up a function table in C++ and tell the superFunc about it via a parameter. That would be what I will do if I don't find a better solution.

Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
St0fF
  • 1,553
  • 12
  • 22
  • Basically the same problem as [Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array](//stackoverflow.com/q/47300844) - `[table + reg*scale]` can't use RIP-relative addressing. – Peter Cordes Aug 20 '19 at 09:58
  • I don't see why you want a chain of indirect jumps. In this example, all the jumps except the first `jmp [jumpTable1 + rcx*8]` have only one possible target so should be `jmp rel32` direct jumps (because reaching this block was only possible with one known value of `rcx`), or much better just fall-through. You might want some macro stuff to keep track of what should jump where. Or to put blocks of code into macros so you can inline them. – Peter Cordes Aug 20 '19 at 10:03
  • @PeterCordes Thank you for your comments. I don't see the same problem as with the thread you linked. There are completely different things happening, besides (N)asm vs. (M)asm. The example I just wrote from scatch, the real problem is by far more complicated, but this example should be sufficient (if filled with data action) to generate the linker errors when including the code in a C++ dll project. – St0fF Aug 20 '19 at 10:07
  • It's exactly the same problem: `[table + rcx*8]` can only be encoded in x86 machine code as `[disp32 + rcx*8]` , and thus only works with non-large addresses that fit in a 32-bit signed absolute address. Both MacOS (always) and Windows (with LARGEADDRESSAWARE:YES) disallow that, so it's the same problem and the solutions are the same. It's a machine-code encoding limitation, nothing to do with which assembler you're using. It's also the same problem as [32-bit absolute addresses no longer allowed in x86-64 Linux?](//stackoverflow.com/q/43367427) where `-no-pie` = no large address – Peter Cordes Aug 20 '19 at 10:10
  • Ok, so in that case I just cannot do it this way, obviously. I could though set up the function tables in C++ and give table pointers to my assembly code as function parameters? I mean, at the end I'm just trying to set up something like a static vTable ... – St0fF Aug 20 '19 at 10:13
  • 1
    Yes you can, it's still possible to index static arrays! The first Q&A I linked shows you how; that's why I linked it. – Peter Cordes Aug 20 '19 at 10:54

1 Answers1

6

RIP-relative addressing only works when there are no other registers in the addressing mode.

[table + rcx*8] can only be encoded in x86-64 machine code as [disp32 + rcx*8], and thus only works with non-large addresses that fit in a 32-bit signed absolute address. Windows can apparently support this with LARGEADDRESSAWARE:NO, like on Linux compiling with -no-pie to solve the same problem.

MacOS has no workaround for it, you can't use 64-bit absolute addressing at all there. Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array shows how to index a static array using a RIP-relative lea to get the table address into a register while avoiding 32-bit absolute addresses.

Your jump tables themselves are fine: they use 64-bit absolute addresses which can be relocated anywhere in virtual address space. (Using load-time fixups after ASLR.)


I think you have one too many levels of indirection. Since you already load a function pointer into a register, you should be using jmp r10 not jmp [r10]. Doing all the loads into registers up front gets them in the pipeline sooner, before any possible branch mispredicts, so is maybe a good idea if you have lots of registers to spare.

Much better would be inlining some of the later blocks, if they're small, because the blocks reachable by any given RCX value aren't reachable any other way. So it would be much better to inline all of func_21 and func_31 into func_11, and so on for func_12. You might use assembler macros to make this easier.

Actually what matters is just that the jump at the end of func_11 always goes to func_21. It's fine of there are other ways to reach that block, e.g. from other indirect branches that skip table 1. That's no reason for func_11 not to fall into it; it only limits what optimizations you can make between those 2 blocks if func_21 still has to be a valid entry point for execution paths that didn't fall through from func_11.


But anyway, you can implement your code like this. If you do optimize it, you can remove the later dispatching steps and the corresponding loads.

I think this is valid MASM syntax. If not, it should still be clear what the desired machine-code is.

    lea    rax,  [jumpTable1]          ; RIP-relative by default in MASM, like GAS [RIP + jumpTable1] or NASM [rel jumpTable1]

    ; The other tables are at assemble-time-constant small offsets from RAX
    mov    r10,  [rax + rcx*8 + jumpTable3 - jumpTable1]
    mov    r11,  [rax + rcx*8 + jumpTable2 - jumpTable1]
    jmp    [rax + rcx*8]


func_11:
    ...
    jmp  r10         ; TODO: inline func_21  or at least use  jmp func_21
                     ;  you can use macros to help with either of those

Or if you only want to tie up a single register for one table, maybe use:

    lea    r10,  [jumpTable1]    ; RIP-relative LEA
    lea    r10,  [r10 + rcx*8]   ; address of the function pointer we want
    jmp    [r10]

align 8
func_11:
    ...
    jmp   [r10 + jumpTable2 - jumpTable1]    ; same index in another table


align 8
func_12:
    ...
    jmp   [r10 + jumpTable3 - jumpTable1]    ; same index in *another* table

This takes full advantage of the known static offsets between tables.


Cache locality for the jump targets

In your matrix of jump targets, any single usage strides down a "column" to follow some chain of jumps. It would obviously be better to transpose your layout so that one chain of jumps goes along a "row", so so all the targets come from the same cache line.

i.e. arrange your table so func_11 and 21 can end with jmp [r10+8], and then jmp [r10+16], instead of + some offset between tables, for improved spatial locality. L1d load latency is only a few cycles so there's not much extra delay for the CPU in check the correctness of branch prediction, vs. if you'd loaded into registers ahead of the first indirect branch. (I'm considering the case where the first branch mispredicts, so OoO exec can't "see" the memory-indirect jmp until after the correct path for that starts to issue.)


Avoiding 64-bit absolute addresses:

You can also store 32-bit (or 16 or 8-bit) offsets relative to some reference address that's near the jump targets, or relative to the table itself.

For example, look at what GCC does when compiling switch jump tables in position-independent code, even for targets that do allow runtime fixups of absolute addresses.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 includes a testcase; see it on Godbolt with GCC's MASM-style .intel_syntax. It uses a movsxd load from the table, then add rax, rdx / jmp rax. The table entries are things like dd L27 - L4 and dd L25 - L4 (where those are label names, giving the distance from a jump target to the "anchor" L4).

(Also related for that case https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Now I'm beginning to understand. Thank you very much. P.S.: as I said, the tables are not as evenly filled as in the example, so although only one index tells what addresses should be taken out of the tables, it's not like there's unreachable code. – St0fF Aug 20 '19 at 11:00
  • @St0fF: huh? I didn't say you had any unreachable code. But if the *only* way to reach `func_21` is from the end of `func_11`, you should just fall through into it instead of reaching it with an indirect branch. Even if that's not true and there are other ways to reach `func_21` (e.g. if something else skips table 1) , you can still put it there so `func_11` can fall through to it. If it's like a vtable, then this is like inlining `foo::f1()` into `foo::f2()` because (with static vtables) the "type" is statically known so we know there's no override. – Peter Cordes Aug 20 '19 at 11:04
  • Sorry, I didn't make it clear enough (problems of non-native speakers). Fact is: the real tables look a lot different, thus func_21 could be followed by multiple other functions, that's why I wanted to do it like that. It's much easier to build a few tables, than writing a big load of code simply to cover all possible combinations. P.S.: it links, so now I can start debugging and finding more mistakes ;) – St0fF Aug 20 '19 at 11:27
  • @St0fF: you don't have to actually *write* a big load of code; you can define each block as a *macro* and glue them together by writing `func_1:` `BLOCK11` ; `BLOCK21` ; `BLOCK31` or something. You can use your BLOCK21 macro anywhere else it's needed, too. If the blocks are small, this inlining is probably worth it. Or if there's redundancy between blocks, just write inline functions in C++ and let a compiler inline + optimize between functions! If your blocks are big enough, branch overhead might not be hurting much (especially if the branches predict well in real use). – Peter Cordes Aug 20 '19 at 11:35
  • @St0fF: In your real use-case I guess you maybe have some duplicate entries in table1 or something, and each "column" of your jump matrix is some unique chain of blocks? That still doesn't rule out inlining with macros. (And BTW, it would obviously be better to go along "rows" so all the targets come from the same cache line. So you just do `jmp [r10+8]` instead of + some offset between tables). – Peter Cordes Aug 20 '19 at 11:40
  • But anyway, unless `func_31` can be the same block as `func_12` or something, I think each block still has a unique target so should at worst be a `jmp rel32` if not a fall-through. You should be able to use macros to put the block-ordering logic in one place in the source, and do something like `jmp BLOCK12_TARGET` where that name is defined with `BLOCK12_TARGET equ func_22` or whatever. – Peter Cordes Aug 20 '19 at 11:42
  • To say the truth: the background is a complex mesh data exporter with data reduction abilities. I had in mind to write specially tailored functions in C++ but I lacked the vigor to do all the writing. Thus I wanted to learn more x64-assembly and gave it a try (coming from 6510 assembly over 20 years ago). I had a lot of fun and learned that restriction and received a much deeper understanding of those CISC addressing modes. So thank you very much for your great input. But please don't bother with the problem at hand. There are millions of ways to implement it and I rather like my solution – St0fF Aug 20 '19 at 22:59
  • P.S.: the first table is a data loader. The 2nd does reduction and the 3rd does storing. I have 8 load functions, 8 store functions and 5 reduction functions. But indeed, I will reorder (and interleave with stride) the tables, so all function pointers come from the same cache line. I'll also combine with your other hint to use the same reg for the jumps with an offset. Thanks a lot for those hints :) – St0fF Aug 20 '19 at 23:13
  • Just one last question: on the C64 call/ret (i.e. jsr/rts) took 6 cycles each, whereas an indirect jump only took 4 - does Intel have that same restriction? Should I rather use "only jumps" - it's not in the example, but the inner loop "call"s the "loader", and the small functions do "jmp", the storer ends with "ret", then it returns back to the inner loop control. Would only jumping improve performance? – St0fF Aug 20 '19 at 23:19
  • @St0fF: ok, if this is partly just a learning exercise, then it's not crazy to actually implement the jumps the way you were picturing even though it's somewhat less efficient. Re: performance: making it like a `switch` with `jmp` (like you show in the question) instead of a table of "function pointers" with `call` for the first step might help slightly, with all "storer" functions ending with a `jmp mainloop_after_store` or something. – Peter Cordes Aug 21 '19 at 00:17
  • But modern x86 is superscalar / out-of-order execution with branch prediction and speculative execution to take control dependencies (branching) off the critical path. Performance of an instruction has 3 dimensions: latency, front-end uop cost, and back-end execution port cost. (Branch prediction hides latency of branching). There's nothing like a 1-dimensional "cycle cost" you can simply add up to get a total. The extra cost of a `call` vs. a `jmp` is an extra uop to push the return address, and only slows you down if there's a front-end or back-end throughput bottleneck. – Peter Cordes Aug 21 '19 at 00:18
  • And CPUs have a return-address predictor stack to predict `ret` correctly for matched call/ret. (Because otherwise it's an indirect branch to an address popped from the stack.) A chain of `jmp` between the call and ret doesn't hurt; the next ret still goes to the instruction after the call. So tailcall optimization (call/ret => jmp) is still good. Anyway, see also https://agner.org/optimize/ and [How many CPU cycles are needed for each assembly instruction?](//stackoverflow.com/a/44980899) and [Modern x86 cost model](//stackoverflow.com/q/9957004) – Peter Cordes Aug 21 '19 at 00:25
  • @St0fF: also related: [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](//stackoverflow.com/q/51607391). Also interesting: [Slow jmp-instruction](//stackoverflow.com/q/38811901) shows what can happen if you have far too many branches and defeat branch prediction. (The cost of a branch is *not* fixed, it depends on how predictable it is. The cost of a correctly-predicted branch is pretty small.) Anyway, in most cases the extra work of push/pop return address is hidden by doing it in parallel with other work – Peter Cordes Aug 21 '19 at 00:28
  • Thank you again, sounds like great input I will study over the next days. – St0fF Aug 21 '19 at 11:28