12

This astonished me, because I always thought that loop should have some inside optimization.

Here are the experiments I did today. I was using Microsoft Visual Studio 2010. My operation system is 64 bit Windows 8. My questions are at the end.

First experiment:

Platform: Win32
Mode: Debug (to disable optimization)

begin = clock();
_asm
{
    mov ecx, 07fffffffh
start:
    loop start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 3.583
(The number changes a little with each run, but it's morally the same size.)

Second experiment:

Platform: Win32
Mode: Debug

begin = clock();
_asm
{
    mov ecx, 07fffffffh
start:
    dec ecx
    jnz start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 0.903

Third and fourth experiment:

Just change the platform to x64. Since VC++ does not support 64 bit inline assembly, I have to put the loop in another *.asm file. But finally the results are the same.

And from this point I begin to use my brain - loop is 4 times slower than dec ecx, jnz start, and the only difference between them, AFAIK, is that dec ecx changes flags while loop doesn't. In order to imitate this keep of flags, I did the

Fifth experiment:

Platform: Win32 (in the following I always suppose that the platform has no effect on the result)
Mode: Debug

begin = clock();
_asm
{
    mov ecx, 07fffffffh
    pushf
start:
    popf
; do the loop here
    pushf
    dec ecx
    jnz start
    popf
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 22.134

This is understandable, because pushf and popf have to play with the memory. But, let's say, for example, that the register eax is not to be kept at the end of the loop (which can be achieved by arranging the registers better), and that the flag OF is not needed in the loop (this simplifies things since OF is not in the lower 8 bits of flag), then we may use lahf and sahf to keep the flags, so I did the

Sixth experiment:

Platform: Win32
Mode: Debug

begin = clock();
_asm
{
    mov ecx, 07fffffffh
    lahf
start:
    sahf
; do the loop here
    lahf
    dec ecx
    jnz start
    sahf
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 1.933

This is still much better than using loop directly, right?

And the last experiment I did is to try to also keep the OF flag.

Seventh experiment:

Platform: Win32
Mode: Debug

begin = clock();
_asm
{
    mov ecx, 07fffffffh
start:
    inc al
    sahf
; do the loop here
    lahf
    mov al, 0FFh
    jo dec_ecx
    mov al, 0
dec_ecx:
    dec ecx
    jnz start
}
end = clock();
cout<<"passed time: "<<double(end - begin)/CLOCKS_PER_SEC<<endl;

Output: passed time: 3.612

This result is the worst case, i.e. OF is not set at each loop. And it is almost the same as using loop directly ...

So my questions are:

  1. Am I right that, the ONLY advantage of using loop is that it takes care of the flags (actually only the 5 of them that dec has effect on)?

  2. Is there a longer form of lahf and sahf which also moves OF, so that we may totally get rid of loop?

WhatsUp
  • 1,618
  • 11
  • 21
  • *because pushf and popf have to play with the memory.* Yes, the loop-carried dependency includes store-forwarding from pushf to popf, then through EFLAGS to the next pushf. But the biggest cost here is `popf` itself, which is super slow because it writes all of EFLAGS, including IF and other machine-status flags, not just the condition codes. (But has to not fault if you aren't trying to change them from user-space). For example, `popf` on Sandybridge is 9 uops with one per 18 clocks throughput (http://agner.org/optimize/) – Peter Cordes May 19 '18 at 19:44

1 Answers1

8

Historically, on the 8088 and 8086 processors, LOOP was an optimization because it only took one cycle longer than a conditional branch, whereas putting a DEC CX before a branch would cost three or four cycles (depending upon the state of the prefetch queue).

Today's processors, however, work very differently from the 8086. For a few processor generations, even though manufacturers have made machines that can process correctly essentially all of the documented instructions that the 8088/8086 or its descendants have ever possessed, they've focused their energies on enhancing only the performance of the most useful instructions. For a variety of reasons, the amount of circuitry Intel or AMD would have to add to a modern CPU to make the LOOP instruction work as efficiently as DEC CX/JNZ would likely exceed the total amount of circuitry in the entire 8086, probably by a huge margin. Rather than increase the complexity of their high-performance CPU, the manufacturers include a much simpler, but slower, processing unit which can handle "obscure" instructions. While a high-performance CPU will need lots of circuitry to allow the execution of multiple instructions to overlap except when later instructions need results from earlier computations (and must wait until they're available), an "obscure instructions processing unit" can avoid the need for such circuitry by simply executing instructions one at a time.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • 2
    Short version: "RISCy" instructions have gotten a lot more optimizations over the years than the "CISCy" instructions. Therefore, "obvious" oldschool optimizations can now hurt performance. :-) – Brian Knoblauch Feb 05 '14 at 14:09
  • 1
    @BrianKnoblauch: Some CISCy instructions have gotten optimized as well (e.g. REP MOVSD). I think the main issue is that if a CISCy instruction represented a common operation that could be optimized to perform *better* than a sequence of RISCy ones, such optimization was worthwhile (it's easier to make REP MOVSD run at two cycles/word than make a 4-5 instruction loop run at that speed) but if the CISC couldn't outperform the RISC, one could gain just as much performance by suggesting people avoid the CISC instructions as by adding hardware to make them as fast as the RISC. – supercat Feb 05 '14 at 16:32
  • 1
    @BrianKnoblauch: Incidentally, given the popularity of the JVM and .NET runtime, I've wondered whether a little hardware oriented toward object-oriented programming and garbage collection might be worthwhile. For example, since few programs need anywhere near four billion distinct objects, nor do they need individual objects over four gig, having 32-bit object references, and a 64-bit pointer type combining an object reference and offset could offer better cache efficiency than using 64-bit object references). Further, if objects could be cheaply tagged any time a reference was copied... – supercat Feb 05 '14 at 16:44
  • 1
    ...a concurrent GC could avoid "stop the world" pauses. Such tagging would likely require a separate cache with somewhat "unusual" semantics (if two cores try to tag the same object at about the same time, neither thread would need to know about the other thread's having tagged the object "first"--unlike a `CompareExchange` scenario--but it would be necessary that the object be recognized as having been tagged). Not exactly a RISCy concept in the normal sense of the word, but nonetheless a useful sort of operation that other RISCy instructions can't do well. – supercat Feb 05 '14 at 16:48
  • Modern Intel already macro-fuses dec-and-branch into a single uop, so that's not the reason. Possibly it would take a lot of transistors to make a version of that which didn't update flags, though. Fun fact: AMD Bulldozer/Ryzen *do* have efficient `loop`, even though compilers never use it. See the linked duplicate. – Peter Cordes May 19 '18 at 19:50