33

I got the below assembly list as result for JIT compilation for my java program.

mov    0x14(%rsp),%r10d
inc    %r10d              

mov    0x1c(%rsp),%r8d
inc    %r8d               

test   %eax,(%r11)         ; <--- this instruction

mov    (%rsp),%r9
mov    0x40(%rsp),%r14d
mov    0x18(%rsp),%r11d
mov    %ebp,%r13d
mov    0x8(%rsp),%rbx
mov    0x20(%rsp),%rbp
mov    0x10(%rsp),%ecx
mov    0x28(%rsp),%rax    

movzbl 0x18(%r9),%edi     
movslq %r8d,%rsi          

cmp    0x30(%rsp),%rsi
jge    0x00007fd3d27c4f17 

My understanding the test instruction is useless here because the main idea of the test is

The flags SF, ZF, PF are modified while the result of the AND is discarded.

and here we don't use these result flags.

Is it a bug in JIT or do I miss something? If it is, where the best place for reporting it? Thanks!

QIvan
  • 652
  • 4
  • 13
  • 2
    This instruction does indeed seem useless. – fuz Jan 05 '19 at 18:23
  • 6
    FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context. –  Jan 05 '19 at 18:45
  • 3
    Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used `mov (%r11), %r9d` because `r9` is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput. – Peter Cordes Jan 06 '19 at 01:36
  • But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: http://publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/posts/comments/93636916) – Peter Cordes Jan 06 '19 at 01:38
  • @PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge? – llllllllll Jan 06 '19 at 04:46
  • @liliscent: I shouldn't have said SnB *and earlier*. I think the paper only claimed SnB specifically vs. IvB, and didn't mention pre-SnB. (And didn't provide any info on exactly how they reached that conclusion, so I don't 100% trust it, but it looks like a good paper otherwise.) Yeah, I think P6-family has a fused-domain ROB. SnB simplified the internal uop format (hence un-lamination of indexed addr modes), so experimenting with an unfused ROB is plausible, then going back to fused-domain ROB for IvB after figuring out how, or just finding it wasn't worth the tradeoff. – Peter Cordes Jan 06 '19 at 05:00
  • @liliscent: As my answer on [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/posts/comments/93636916) (Footnote 2) showed with experimental results from SKL and Conroe/Merom, P6-family has a fused-domain RS, but Skylake has an unfused-domain RS where each uop to be dispatched takes a separate entry. Well spotted the Agner Fog's uarch guide is wrong about the RS being fused-domain when he describes it in the HSW/BDW section. I think I missed that and had always assumed the RS was unfused-domain before testing. – Peter Cordes Jan 06 '19 at 05:03
  • @PeterCordes I think you're right, especially after reading your linked answer. There must be some change in SnB. Anger Fog's description of micro fusion is too vague for later ones, they seem to simply claim SnB - SKL are exactly the same as PM without much detail. – llllllllll Jan 06 '19 at 07:09
  • @liliscent: Agner hasn't had as much time to dedicate to testing stuff for HSW/SKL as he used to have, I think. He did a lot of work on figuring out Sandybridge's uop-cache, but some HSW/SKL stuff is unfortunately just copy-pasted incorrectly. And yeah, he can miss changes nobody was looking for! e.g. He counted uops by testing how they pack into the uop-cache, rather than with perf counters for fused/unfused-domain, which is why he missed un-lamination on SnB-family entirely, and couldn't reproduce my results on [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634) – Peter Cordes Jan 06 '19 at 07:20

1 Answers1

46

That must be the thread-local handshake poll. Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:

  0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
  0.19%  │  ...78: mov    0x108(%r15),%r11  ; read the thread-local page addr
 25.62%  │  ...7f: add    $0x1,%rbp          
 35.10%  │  ...83: test   %eax,(%r11)       ; thread-local handshake poll
 34.91%  │  ...86: test   %r10d,%r10d
         ╰  ...89: je     ...70

It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.

UPD: Hopefully, more details here.

Aleksey Shipilev
  • 18,599
  • 2
  • 67
  • 86