I've developed a little benchmark. The result of the loop inside this benchmark should be transformed to zero and the next round of my calculation should depend on the zero-"result" of the loop before to measure the latency of the code and not its througput. MOVing the result to another register and XORing itself doesn't work because today's CPUs recognize that an XOR with itself isn't dependent on the instructions before. So I tried to subtract the register from itself hoping that the CPU (Ryzen Threadripper 3990X) hasn't such a shortcut like with XOR. I evaluated this with a separate program:
#include <iostream>
#include <chrono>
using namespace std;
using namespace chrono;
int main()
{
auto start = high_resolution_clock::now();
for( size_t i = 1'000'000'000; i--; )
__asm
{
sub eax, eax
sub eax, eax
sub eax, eax
sub eax, eax
sub eax, eax
sub eax, eax
sub eax, eax
sub eax, eax
sub eax, eax
sub eax, eax
}
double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / 10'000'000'000.0;
cout << ns << endl;
}
"Unfortunately" the CPU also does a shortcut here and each instruction takes abut 0.06ns, ie. the CPU does about six sub eax, eax in each clock cycle (4,3GHz).
So is there a way to have an instruction that results in zero and that this instruction is dependent on the instruction before on a moderen CPU ?