This astonished me, because I always thought that loop
should have some inside optimization.
Here are the experiments I did today. I was using Microsoft Visual Studio 2010. My operation system is 64 bit Windows 8. My questions are at the end.
First experiment:
Platform: Win32
Mode: Debug (to disable optimization)
begin = clock();
_asm
{
mov ecx, 07fffffffh
start:
loop start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 3.583
(The number changes a little with each run, but it's morally the same size.)
Second experiment:
Platform: Win32
Mode: Debug
begin = clock();
_asm
{
mov ecx, 07fffffffh
start:
dec ecx
jnz start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 0.903
Third and fourth experiment:
Just change the platform to x64. Since VC++ does not support 64 bit inline assembly, I have to put the loop in another *.asm
file. But finally the results are the same.
And from this point I begin to use my brain - loop
is 4 times slower than dec ecx, jnz start
, and the only difference between them, AFAIK, is that dec ecx
changes flags while loop
doesn't. In order to imitate this keep of flags, I did the
Fifth experiment:
Platform: Win32 (in the following I always suppose that the platform has no effect on the result)
Mode: Debug
begin = clock();
_asm
{
mov ecx, 07fffffffh
pushf
start:
popf
; do the loop here
pushf
dec ecx
jnz start
popf
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 22.134
This is understandable, because pushf
and popf
have to play with the memory. But, let's say, for example, that the register eax
is not to be kept at the end of the loop (which can be achieved by arranging the registers better), and that the flag OF
is not needed in the loop (this simplifies things since OF
is not in the lower 8 bits of flag
), then we may use lahf
and sahf
to keep the flags, so I did the
Sixth experiment:
Platform: Win32
Mode: Debug
begin = clock();
_asm
{
mov ecx, 07fffffffh
lahf
start:
sahf
; do the loop here
lahf
dec ecx
jnz start
sahf
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 1.933
This is still much better than using loop
directly, right?
And the last experiment I did is to try to also keep the OF
flag.
Seventh experiment:
Platform: Win32
Mode: Debug
begin = clock();
_asm
{
mov ecx, 07fffffffh
start:
inc al
sahf
; do the loop here
lahf
mov al, 0FFh
jo dec_ecx
mov al, 0
dec_ecx:
dec ecx
jnz start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;
Output: passed time: 3.612
This result is the worst case, i.e. OF
is not set at each loop. And it is almost the same as using loop
directly ...
So my questions are:
Am I right that, the ONLY advantage of using loop is that it takes care of the flags (actually only the 5 of them that
dec
has effect on)?Is there a longer form of
lahf
andsahf
which also movesOF
, so that we may totally get rid ofloop
?