Deleteing initialization leads to avx2 fma performance drop. Why?

Question

I put a link here: https://godbolt.org/z/d6bx9vh1s. You can freely browse, edit and check speed.

I wrote a piece of code to test AVX2 FMA's maximum speed. But, I found that deleting the xor section leads to a huge performance drop (from 100+ GFLOPs down to ~1GFLOPs).

#include <chrono>
#include <iostream>

int main() {
  int t = 1 << 20;

  std::chrono::high_resolution_clock::time_point t1 =
      std::chrono::high_resolution_clock::now();
  asm volatile(R"(
vxorps %%ymm0, %%ymm0, %%ymm0
vxorps %%ymm1, %%ymm1, %%ymm1
vxorps %%ymm2, %%ymm2, %%ymm2
vxorps %%ymm3, %%ymm3, %%ymm3
vxorps %%ymm4, %%ymm4, %%ymm4
vxorps %%ymm5, %%ymm5, %%ymm5
vxorps %%ymm6, %%ymm6, %%ymm6
vxorps %%ymm7, %%ymm7, %%ymm7
vxorps %%ymm8, %%ymm8, %%ymm8
vxorps %%ymm9, %%ymm9, %%ymm9

loop:

vfmadd231ps %%ymm0, %%ymm0, %%ymm0
vfmadd231ps %%ymm1, %%ymm1, %%ymm1
vfmadd231ps %%ymm2, %%ymm2, %%ymm2
vfmadd231ps %%ymm3, %%ymm3, %%ymm3
vfmadd231ps %%ymm4, %%ymm4, %%ymm4
vfmadd231ps %%ymm5, %%ymm5, %%ymm5
vfmadd231ps %%ymm6, %%ymm6, %%ymm6
vfmadd231ps %%ymm7, %%ymm7, %%ymm7
vfmadd231ps %%ymm8, %%ymm8, %%ymm8
vfmadd231ps %%ymm9, %%ymm9, %%ymm9

addl $-1, %0
jne loop
  )" ::"r"(t));
  std::chrono::high_resolution_clock::time_point t2 =
      std::chrono::high_resolution_clock::now();

  int64_t flops_per_iter = 10 * 8 * 2;
  int64_t flops = flops_per_iter * t;
  double seconds =
      std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)
          .count();
  double flops_per_second = flops / seconds;
  printf("%.4f GFLOPS\n", flops_per_second / (1e9));

  return 0;
}

The result should be around 100+ GFLOPs. But if you delete the xor part:

#include <chrono>
#include <iostream>

int main() {
  int t = 1 << 20;

  std::chrono::high_resolution_clock::time_point t1 =
      std::chrono::high_resolution_clock::now();
  asm volatile(R"(
loop:

vfmadd231ps %%ymm0, %%ymm0, %%ymm0
vfmadd231ps %%ymm1, %%ymm1, %%ymm1
vfmadd231ps %%ymm2, %%ymm2, %%ymm2
vfmadd231ps %%ymm3, %%ymm3, %%ymm3
vfmadd231ps %%ymm4, %%ymm4, %%ymm4
vfmadd231ps %%ymm5, %%ymm5, %%ymm5
vfmadd231ps %%ymm6, %%ymm6, %%ymm6
vfmadd231ps %%ymm7, %%ymm7, %%ymm7
vfmadd231ps %%ymm8, %%ymm8, %%ymm8
vfmadd231ps %%ymm9, %%ymm9, %%ymm9

addl $-1, %0
jne loop
  )" ::"r"(t));
  std::chrono::high_resolution_clock::time_point t2 =
      std::chrono::high_resolution_clock::now();

  int64_t flops_per_iter = 10 * 8 * 2;
  int64_t flops = flops_per_iter * t;
  double seconds =
      std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)
          .count();
  double flops_per_second = flops / seconds;
  printf("%.4f GFLOPS\n", flops_per_second / (1e9));

  return 0;
}

The performance drops to nearly 1 GFLOPs.

This is so strange.

I get 103GFLOPS with second one and ```-O3 -march=native -mavx512f -mprefer-vector-width=512 -fno-math-errno``` . What if non-initialized values are denormals? I heard that Intel is not good at computing on them. Also isn't that a bit too few operations to measure anything? Why don't you use a loop to heat the CPU first, then use multiple repeats of measurement to have less measurement error? Maybe it was just the uncertainity of the server load (it could be another user running AVX512 on same core)? — huseyin tugrul buyukisik, May 30 '22 at 11:21
1. Notice that i wrote the loop inside the assembly code; 2. Nobody's using the server except me. — tigertang, May 30 '22 at 11:26
Okay. Fine. But the second piece code in my pc stably gives 1 GFLOPs. — tigertang, May 30 '22 at 11:27
The same phenomenon happens on godbolt compiler explorer as well. Just try it. — tigertang, May 30 '22 at 11:30
BTW - even though this extended asm block *might* work, it's technically undefined behaviour. You don't specify the registers you modify, and you modify an input-only register. That said, floating-point (FMA) units are always going to handle 'zero' values much more efficiently. I would expect using `vxorps` to zero a register to be handled as a special (fast) case. In fact, if Cordes hasn't linked [this answer](https://stackoverflow.com/a/33668295/906839) yet, it means he's probably sleeping:) — Brett Hale, May 30 '22 at 12:17
@Brett Hale Of course it works. It precisely reflects the maximum AVX2 performance of your CPU. Also, after I added clobber list and initialization to this code, I found that whatever initialization value ymm0 to ymm9 have, it gives a fast result. But no intialization is always slow. — tigertang, May 30 '22 at 13:57
@BrettHale: Other than subnormals (which take microcode assists), FMA performance is not data-dependent. Given the factor of 100, it probably is subnormals. Zero isn't special vs. other finite values or Inf or NaN, but it does avoid subnormals. xor-zeroing isn't inside the loop, so its performance doesn't matter. Init with other values like `-1.0` would be fine, or even something that grows to +inf as you iterate `a = a*a + a`, like `+1.0`. `-ffast-math` would treat subnormals as zero if they ever showed up, restoring performance even without init. — Peter Cordes, May 30 '22 at 15:30
And BTW, I thought from the title the problem might be [Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says](https://stackoverflow.com/q/64116679) but this doesn't leave any unwritten regs after the first iteration. — Peter Cordes, May 30 '22 at 15:35

Deleteing initialization leads to avx2 fma performance drop. Why?

0 Answers0