1

For efficiency I want to use a table of addresses that I can index by a register to jmp to within an assembler routine.

An example might make this clearer...

.CODE   
...
AppendByte  PROC
    XOR     RAX, RAX
    MOV     AL,  CL     ; Get this 0-7 index from somewhere else
    JMP     QWORD PTR[RAX + OFFSET APPENDBYTETABLE]
AppendByte  ENDP

AppendByte_7:
    ; Do stuff...
    RET

AppendByte_6:
    ; Do stuff...
    RET

...
AppendByte_0:
    RET

.DATA
    APPENDBYTETABLE QWORD   AppendByte_0, AppendByte_1, AppendByte_2,
                            AppendByte_3, AppendByte_4, AppendByte_5,
                            AppendByte_6, AppendByte_7

END 

This compiles in VS2017 but I then get a linker error. I think this relates to using FAR addresses. How do I generate NEAR offsets and perform a SHORT jmp to offsets stored in a table in the DATA segment?

Note that if I put the AppendByte_x labels inside my proc then the compiler croaks.

RESOLVED! EDIT after advise from Fuz...

XOR         RAX, RAX
MOV         AL, REG_PREFIXCODEBITS  
LEA         RCX, APPENDBYTETABLE
JMP         QWORD PTR [RCX + RAX * 8]
nearproc
  • 35
  • 8
  • What linker error do you get exactly? “I get an error” is not a useful error description. – fuz Oct 10 '18 at 09:41
  • Appologies, and thanks for the quick response! The error: 1>first.obj : error LNK2017: 'ADDR32' relocation to 'APPENDBYTETABLE' invalid without /LARGEADDRESSAWARE:NO 1>LINK : fatal error LNK1165: link failed because of fixup errors – nearproc Oct 10 '18 at 09:51
  • Ah, I see what the problem is; let me write an answer. – fuz Oct 10 '18 at 09:53
  • 2
    Try replacing `jmp qword ptr [rax + offset APPENDBYTETABLE]` with `lea rcx, APPENDBYTETABLE` and then `jmp qword ptr [rcx + rax]`. Does that work? – fuz Oct 10 '18 at 09:55
  • Thanks, I'll give that a go at lunch time :) – nearproc Oct 10 '18 at 09:57
  • Also note that you might need `rax*8` instead of just `rax` if the index is really in the range 0–7. – fuz Oct 10 '18 at 09:58
  • Is this the way YOU would do this? I.e. Is there a way to perform a NEAR jump and not use QWORDs in the table? I'm fairly new to ML64. – nearproc Oct 10 '18 at 09:58
  • @neaproc This is not about near and far jumps; far jumps are related to segmentation which isn't really a thing in amd64 assembly. While I'm not too familiar with Microsoft's toolchain, I believe the problem is that memory operands only take 32 bit displacements but the linker wants to allow your program to be loaded to memory above address 2³². This can be avoided using `lea` to grab the address of `APPENDBYTETABLE` using an `rip` relative addressing mode. If this is the solution to the problem, I can write a detailed answer for you, but I want to be sure that it works first. – fuz Oct 10 '18 at 10:01
  • Great, thanks for the explanation. I'll get back to you. – nearproc Oct 10 '18 at 10:02
  • BRILLIANT! Compiles and links successfully now. You certainly know your stuff! – nearproc Oct 10 '18 at 10:48
  • Could I also use OFFSET instead of LEA? – nearproc Oct 10 '18 at 10:53
  • I'm not sure, but anyway `lea` has the shorter encoding so it should be preferred. Let me write up an answer for you. – fuz Oct 10 '18 at 10:59
  • By that comment I mean... MOV AL, REG_PREFIXCODEBITS MOV RCX, OFFSET APPENDBYTETABLE JMP QWORD PTR [RCX + RAX * 8] – nearproc Oct 10 '18 at 10:59
  • Ok thanks I'll stick with LEA. – nearproc Oct 10 '18 at 10:59
  • Curious as to where you picked up your expansive assembler skills, know any good resources? – nearproc Oct 10 '18 at 11:00
  • 1
    I picked up most of that by observing what kind of assembly the compiler generates and from countless resources on the internet. I haven't really followed a tutorial on that subject. – fuz Oct 10 '18 at 11:02
  • Well, either way I'm impressed. Thanks again :) – nearproc Oct 10 '18 at 11:08
  • unrelated: you want `movzx eax, cl` for efficiency, not xor-zero + a byte merge. It's smaller and faster (zero latency on IvyBridge and later). [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342) You probably also want `jmp [table + rax*8]`, because the entries are 8 bytes each. Or if a disp32 won't link, then you should use a RIP-relative LEA like suggested. But anyway, you definitely want to take advantage of `movzx`. – Peter Cordes Oct 10 '18 at 19:09
  • Good to know. Thanks – nearproc Oct 12 '18 at 17:26

1 Answers1

3

While I am not very familiar with Microsoft's toolchain, I believe the main problem is that in SIB (scale/index/base) addressing modes such as [RAX + OFFSET APPENDBYTETABLE], the displacement is limited to either one or four bytes. The Microsoft linker wants to make your program loadable from any address, including those above the first 4 GB address space, requiring the full 8 bytes to represent an address. Obviously, 4 bytes are not enough to fit 8 bytes, so the linker rightfully complains.

To fix this, you have to first load a register with the address of APPENDBYTETABLE and then index into the table. The general way to do this is to use a lea (load effective address) instruction. lea rax, foo is like mov rax, foo but instead of loading the memory at foo, the address of foo is returned. This can be used in conjunction with a rip (instruction pointer) relative addressing mode to fetch the address of APPENDBYTETABLE despite the displacement being again limited to 4 bytes. This is because the linker assumes that each program or DLL is individually smaller than 2 GB, so a signed 32 bit offset is always enough to find the address of a variable or function relative to the location of the current instruction. The assembler implicitly chooses a rip relative addressing mode when you access a variable directly without using an index register or SIB addressing mode:

lea rax, APPENDBYTETABLE   ; load address of APPENDBYTETABLE rip-relative

You can of course also use mov reg, offset foo to load the address of foo. This uses a form of mov with an 8 byte immediate. However, this instruction has a longer encoding than lea reg, foo, is likely slower, and potentially requires the loader to patch in the correct address at runtime, slowing down the start of your program. Just stick with lea if there is no good reason to do otherwise.

fuz
  • 88,405
  • 25
  • 200
  • 352