5

I am looking at the below virtual method call in x86-64:

mov     rcx, qword ptr [x]   
mov     rax, qword ptr [rcx]
call    qword ptr [rax+8]

and also Agner Fog's latency tables:

http://www.agner.org/optimize/instruction_tables.pdf

As I am using an Ivy Bridge CPU I am looking at page 175.

  1. Am I right in that the first two MOV instructions both only take 2 (they are both move memory to register) CPU cycles? I thought a call to a virtual method was slower than this?

  2. In the instruction latency table page 178 it says the latency of this call is 2 CPU cycles (I think?). What does CALL 'near' mean, as opposed to CALL 'r' (register) and CALL 'm' (memory)?

  3. So the above ASM does take 6 CPU cycles according to the Fog booklet, I haven't misinterpretted anything?

EDIT: I changed the virtual function call to be the second in the vtable.

osgx
  • 90,338
  • 53
  • 357
  • 513
user997112
  • 29,025
  • 43
  • 182
  • 361
  • Don't forget that any of these memory accesses can cache miss. And the call may also invoke a branch target misprediction. – Mysticial Mar 08 '14 at 04:18
  • @Mysticial completely understood. Was just trying to look at the guaranteed minimum cost. – user997112 Mar 08 '14 at 04:30
  • Since the only dependency on the moves is confirming the call target prediction, for a correct prediction the latency of the operations would be hidden by out-of-order execution (there would be fetch, decode, and execution overhead). However, the latencies of the moves would increase the misprediction penalty since the true value would be available later than if the call address had been in a register already. –  Mar 08 '14 at 07:51
  • @PaulA.Clayton the above instructions all depend on each other though- so they would have to be executed in that order? 3rd depends on 2nd and 2nd depends on 1st? – user997112 Mar 08 '14 at 20:23
  • Yes, they are dependent (which increases the penalty on a target misprediction). However, these dependencies do not chain to anything else (assuming correct target prediction), so the OoO mechanism can keep executing instructions. They will take up ROB space (and issue queue slots for the latter two) and delay later instructions *committing*, but barring a misprediction (which can be common for indirect calls) this would be similar to the cost of an equal number of dependent loads and a (dependent) add whose result is never used. –  Mar 08 '14 at 21:40
  • 1
    @user997112: **near** and **far** calls differ by whether the target function is in the same memory *segment* (horrible stuff, stick to x86-64 and you won't meet this horror), while **register (r)** or **memory (m)** calls differ by a level of indirection. There are also relative calls, and those are probably the most common. – EOF Mar 08 '14 at 23:19
  • "first two MOV instructions both only take 2 ... CPU cycles?" - *NO*, because they are loads from memory. If the load targets are cached, each will take [3-4 cycles to load data even from L1](http://stackoverflow.com/a/2354955/196561), dozen cycles from L2 and hundreds in case of memory access. If you have lot different indirect calls there are higher probability of misses. And OoO will not finish ("retire") any instruction from predicted target (even if predicted right target) before real target check (it will wait all accesses to caches and/or to memory). So, I estimate cost as several times – osgx Mar 15 '14 at 06:31

1 Answers1

4

Am I right in that the first two MOV instructions both only take 2 (they are both move memory to register) CPU cycles? I thought a call to a virtual method was slower than this? In the instruction latency table page 178 it says the latency of this call is 2 CPU cycles (I think?).

No, 2 CPU cycles in only the minimal latency.

Let's check the Agner's tables http://www.agner.org/optimize/instruction_tables.pdf

Integer instructions.

Instruction Operands uops fused domain uops unfused domain (p015 p0 p1 p5 p23 p4) Latency Reciprocal throughput Comments

Inst   Oper        fus p23 p4  Latency Rec.
MOV r32/64,m32/64   1   1        2     0.5

To find time, when instruction will produce their results, you should use "Latency" column. And the latency is 2 cycles for each mov, and listed only minimal value (check text in "Explanation of column headings" - "Latency - This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, ...may increase the clock counts considerably.")

If you have lot of different polymorphic calls, the memory needed for them may be not cached. We know cache and memory latencies from different reviews, and all were measured via long chain of dependent MOVs like mov eax, [eax]; mov eax, [eax]; mov eax, [eax]; .... Values for Ivy are: hit in L1 = 4 cycles, hit in L2 = 11 cycles, hit in L3 = 30-40 cycles, miss in cache and access memory = 32cycles + 60 ns (at 3 GHz with 3 cycles per ns > 200 cycles). There are even no easy case to get 2 cycle latency (what is closer to ALU than L1? Only 72-entry load buffer for reordered loads?), and there will be no chance to have 2 cycle latency on the second mov (its operand is result of first mov, so there is nothing to execute out-of-order before first mov retirement).

In the tables http://instlatx64.atw.hu/ linked from Agner's Links there is report for Ivy InstLatX64 for Intel Core i7-3770K, 3700 MHz made with aida_bench64.dll

27 AMD64 :MOV r64, [m64] L: 1.14ns= 4.0c T: 0.14ns= 0.50c

And this table shows real latency (L) for hit in L1 cache, 4 cycles.

Same data (4c for L1, ~12c for L2, 26-31c for L3) in 64-ia-32-architectures-optimization-manual.pdf page 46 section "2.2.5.1 Load and Store Operation Overview", Table "2-10 Lookup Order and Load Latency"

So the above ASM does take 6 CPU cycles according to the Fog booklet, I haven't misinterpretted anything?

In the best case, when first load was executed early with Out-of-order = 2 cycles on critical path; second load hit in L1 = 4 cycles on critical path; 2 cycles for call execution; BTB (branch target prediction/indirect branch target) was successful, which is more probable when from single address of call you always jumps to the same target (or to small number of targets with periodic patterns) -- you will have 8 cycles to confirm that branch was predicted right, which may be partially hidden by OoO execution of target function.

If any load misses in L1/L2, you should add corresponding cache latency. If L3 misses, add 200 cycles.

If BTB misses, you will have at least 15 cycle penalty (check Agner's microarchitecture.pdf, page 27 "3.7 Branch prediction in Intel Sandy Bridge and Ivy Brindge; Misprediction penalty") - for cached uops; more for target in L1i. You may read about older BTB in the same microarchitecture.pdf page 25 "3.5 Branch prediction in PM and Core2; Pattern recognition for indirect jumps and calls" and "BTB organization .. for indirect jumps and indirect calls".

The very useful document is from Intel: "Intel® 64 and IA-32 Architectures Optimization Reference Manual" 64-ia-32-architectures-optimization-manual.pdf. It has both tuning suggestions and information about performance counters, which will help you to get real latencies and miss rates for your code (check B.6.3.2 section "Virtual Tables and Indirect Calls").

Community
  • 1
  • 1
osgx
  • 90,338
  • 53
  • 357
  • 513
  • A block of 3 or 4 µops should be effectively fully hidden (rather than "partially hidden by OoO execution of target function") on Ivy Bridge with >50 issue queue entries and >100 ROB entries — under the correct target prediction — since there are no *data* dependencies on the call. By the way, the OP commented "Was just trying to look at the guaranteed minimum cost." (perhaps that should be edited into the question), so the extra information is nice and useful but not *strictly* necessary to answer the question. –  Mar 15 '14 at 15:13