2

After testing about 10 billion times, if imm64 is 0.1 nanoseconds faster than m64 for AMD64, The m64 seems to be faster, but I don't really understand. Isn't the address of val_ptr in the following code an immediate value itself?

# Text section
.section __TEXT,__text,regular,pure_instructions
# 64-bit code
.code64
# Intel syntax
.intel_syntax noprefix
# Target macOS High Sierra
.macosx_version_min 10,13,0

# Make those two test functions global for the C measurer
.globl _test1
.globl _test2

# Test 1, imm64
_test1:
  # Move the immediate value 0xDEADBEEFFEEDFACE to RAX (return value)
  movabs rax, 0xDEADBEEFFEEDFACE
  ret
# Test 2, m64
_test2:
  # Move from the RAM (val_ptr) to RAX (return value)
  mov rax, qword ptr [rip + val_ptr]
  ret
# Data section
.section __DATA,__data
val_ptr:
  .quad 0xDEADBEEFFEEDFACE

The measurement code is:

#include <stdio.h>            // For printf
#include <stdlib.h>           // For EXIT_SUCCESS
#include <math.h>             // For fabs
#include <stdint.h>           // For uint64_t
#include <stddef.h>           // For size_t
#include <string.h>           // For memset
#include <mach/mach_time.h>   // For time stuff

#define FUNCTION_COUNT  2     // Number of functions to test
#define TEST_COUNT      0x10000000  // Number of times to test each function

// Type aliases
typedef uint64_t rettype_t;
typedef rettype_t(*function_t)();

// External test functions (defined in Assembly)
rettype_t test1();
rettype_t test2();

// Program entry point
int main() {

  // Time measurement stuff
  mach_timebase_info_data_t info;
  mach_timebase_info(&info);

  // Sums to divide by the test count to get average
  double sums[FUNCTION_COUNT];

  // Initialize sums to 0
  memset(&sums, 0, FUNCTION_COUNT * sizeof (double));

  // Functions to test
  function_t functions[FUNCTION_COUNT] = {test1, test2};

  // Useless results (should be 0xDEADBEEFFEEDFACE), but good to have
  rettype_t results[FUNCTION_COUNT];

  // Function loop, may get unrolled based on optimization level
  for (size_t test_fn = 0; test_fn < FUNCTION_COUNT; test_fn++) {
    // Test this MANY times
    for (size_t test_num = 0; test_num < TEST_COUNT; test_num++) {
      // Get the nanoseconds before the action
      double nanoseconds = mach_absolute_time();
      // Do the action
      results[test_fn] = functions[test_fn]();
      // Measure the time it took
      nanoseconds = mach_absolute_time() - nanoseconds;

      // Convert it to nanoseconds
      nanoseconds *= info.numer;
      nanoseconds /= info.denom;

      // Add the nanosecond count to the sum
      sums[test_fn] += nanoseconds;
    }
  }
  // Compute the average
  for (size_t i = 0; i < FUNCTION_COUNT; i++) {
    sums[i] /= TEST_COUNT;
  }

  if (FUNCTION_COUNT == 2) {
    // Print some fancy information
    printf("Test 1 took %f nanoseconds average.\n", sums[0]);
    printf("Test 2 took %f nanoseconds average.\n", sums[1]);
    printf("Test %d was faster, with %f nanoseconds difference\n", sums[0] < sums[1] ? 1 : 2, fabs(sums[0] - sums[1]));
  } else {
    // Else, just print something
    for (size_t fn_i = 0; fn_i < FUNCTION_COUNT; fn_i++) {
      printf("Test %zu took %f clock ticks average.\n", fn_i + 1, sums[fn_i]);
    }
  }

  // Everything went fine!
  return EXIT_SUCCESS;
}

So, which really is fastest, m64 or imm64?

By the way, I'm using Intel Core i7 Ivy Bridge and DDR3 RAM. I'm running macOS High Sierra.

EDIT: I inserted the ret instructions, and now imm64 turned out to be faster.

SpilledMango
  • 575
  • 7
  • 25
  • How exactly did you test? Your `_test1` doesn't return: it falls through into `_test2`. (which also falls through). Did you unroll at all? – Peter Cordes Sep 26 '17 at 18:53
  • You call `mach_absolute_time()` twice per call to the test function!?!?!???? It takes orders of magnitude longer, since it has to run at least a `RDTSC` instruction (21 uops, and throughput of one per 27c on IvB). Well you're probably not going to be bottlenecked on the front-end. – Peter Cordes Sep 27 '17 at 13:08
  • But how can I solve that? – SpilledMango Sep 27 '17 at 14:40
  • 1
    You really can't. Read Agner Fog's asm optimization guides to realize that measuring a single instruction by itself is nonsensical on modern x86. Or at least that the correct measurement has 3 dimensions: front-end cost in uops, latency (doesn't really apply for setting up constants), and which execution port resources it consumes). It's only sensible to time an entire loop which contains the instruction, and realize that the result depends on the details of the loop and doesn't tell you when any given instruction in it will be better or worse in a different loop. – Peter Cordes Sep 27 '17 at 15:38
  • Basically, your mental model of CPU performance is incorrect. You can't just measure a single instruction. You're having the same problem as https://stackoverflow.com/questions/46442178/how-is-it-possible-that-bitwise-and-operation-to-take-more-cpu-clocks-than-arith, except you're at least using asm instead of trying to time a `+` operator in C. But it's the same fundamental mistake: context matters, and **the fastest way to do something in one context might not be the best with different surrounding code.** See BeeonRope's comments on my answer. – Peter Cordes Sep 27 '17 at 15:42

1 Answers1

5

You don't show the actual loop you tested with, or say anything about how you measured time. Apparently you measured wall-clock time, not core clock cycles (with performance counters). So your sources of measurement noise include turbo / power-saving as well as sharing a physical core with another logical thread (on an i7).


On Intel IvyBridge:

movabs rax, 0xDEADBEEFFEEDFACE is an ALU instruction

  • Take 10 bytes of code-size (which might or might not matter depending on surrounding code).
  • Decodes to 1 uop for any ALU port (p0, p1, or p5). (max throughput = 3 per clock)
  • Takes 2 entries in the uop cache (because of the 64-bit immediate), and takes 2 cycles to read from the uop cache. (So running from the loop buffer is a significant advantage for front-end throughput, if that's the bottleneck in code containing this).

mov rax, [RIP + val_ptr] is a load

  • takes 7 bytes (REX + opcode + modrm + rel32)
  • decodes to 1 uop for either load port (p2 or p3). (max throughput = 2 per clock)
  • fits in 1 entry in the uop cache (no immediate and 32 or 32small address offset).
  • runs a lot slower if the load is split across a page boundary, even on Skylake.
  • can miss in cache the first time.

Source: Agner Fog's microarch pdf and instruction tables. See Table 9.1 for uop-cache stuff. See also other performance links in the tag wiki.


Compilers usually choose to generate 64-bit constants with a mov r64, imm64. (Related: What are the best instruction sequences to generate vector constants on the fly?, but in practice those never come up for scalar integer because there's no short single-instruction way to get a 64-bit -1.)

That's generally the right choice, although in a long-running loop where you expect the constant to stay hot in cache it could be a win to load it from .rodata. Especially if that lets you do something like and rax, [constant] instead of movabs r8, imm64 / and rax, r8.

If your 64-bit constant is an address, use a RIP-relative lea instead, if possible. lea rax, [rel my_symbol] in NASM syntax, lea my_symbol(%rip), %rax in AT&T.


The surrounding code matters a lot when considering tiny sequences of asm, especially when they compete for different throughput resources.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    @SpilledMango: So you were testing a loop that called these functions?? Call/ret have more limited throughput than these instructions. (1 per 2 clocks on IvB, according to Agner Fog). You can get some weird effects with call/ret of tiny functions. See this case for Skylake: https://stackoverflow.com/questions/45442458/loop-with-function-call-faster-than-an-empty-loop/45529487. Putting a `call` in a loop actually sped it up, by lowering average store-forwarding latency. So be careful you're measuring the right thing. – Peter Cordes Sep 26 '17 at 19:35
  • 1
    Also, read, then read again and finally read one last time what Peter wrote at the end: _The surrounding code matters a lot when considering tiny sequences of asm, especially when they compete for different throughput resources._ It's certainly "interesting" to ask which instruction or small asm idiom is _faster_, but understand that it often depends on the context. I think a lot of these questions come from the idea that if you want to do XYZ, you can determine the fastest way to do X, and Y and Z and string them together to get the fastest XYZ. This often simply isn't true. – BeeOnRope Sep 26 '17 at 21:53
  • It comes from a mental model where series of instructions take a fixed period of time, not overlapping or depending on adjacent blocks, and not affecting "at a distance" other bits of code. All of those assumptions are often false. It's still very important to know what is _usually_ fast or to extend your mental model so you can dismiss slow approaches immediately, but the best way to a fast XYZ is to try various X, Y and Z, all together in the final XYZ, with realistic input data. – BeeOnRope Sep 26 '17 at 21:54