4

I've developed a little benchmark. The result of the loop inside this benchmark should be transformed to zero and the next round of my calculation should depend on the zero-"result" of the loop before to measure the latency of the code and not its througput. MOVing the result to another register and XORing itself doesn't work because today's CPUs recognize that an XOR with itself isn't dependent on the instructions before. So I tried to subtract the register from itself hoping that the CPU (Ryzen Threadripper 3990X) hasn't such a shortcut like with XOR. I evaluated this with a separate program:

#include <iostream>
#include <chrono>

using namespace std;
using namespace chrono;

int main()
{
    auto start = high_resolution_clock::now();
    for( size_t i = 1'000'000'000; i--; )
        __asm
        {
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
        }
    double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / 10'000'000'000.0;
    cout << ns << endl;
}

"Unfortunately" the CPU also does a shortcut here and each instruction takes abut 0.06ns, ie. the CPU does about six sub eax, eax in each clock cycle (4,3GHz).

So is there a way to have an instruction that results in zero and that this instruction is dependent on the instruction before on a moderen CPU ?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Bonita Montero
  • 2,817
  • 9
  • 22
  • 3
    Those aren't "serializing". In x86 terminology, that means blocking exec of later instructions, and draining the store buffer. (i.e. serializing / flushing the whole pipeline). The term you're looking for is carrying a dependency, or not dep-breaking. I'd recommend looking at sequences https://uops.info/ uses to benchmark latency, e.g. https://uops.info/html-lat/ZEN+/DIV_R64-Measurements.html#lat2-%3E2 uses AND with a register they zeroed earlier. (For their throughput tests they unroll a lot more, but a small chain like this is similar to how they use nanobench for latency.) – Peter Cordes Mar 11 '22 at 13:53
  • 2
    Related: [Make a register depend on another one without changing its value](https://stackoverflow.com/q/51646598). (And my answer happens to start with an answer to your question.) Possible duplicate of [What is the best way to set a register to zero in x86 assembly: xor, mov or and?](https://stackoverflow.com/q/33666617) which covers the fact that AND isn't dep-breaking. – Peter Cordes Mar 11 '22 at 13:58

1 Answers1

6

Use an and with an immediate of zero.

and eax, 0

The instructions xor eax, eax and sub eax, eax are both recognised as zeroing idioms and won't do the trick.

fuz
  • 88,405
  • 25
  • 200
  • 352