Why does turbofan peel small loops?

Question

When compiling a loop, turbofan seems to peel the first loop iteration most of the time. For example a loop like:

function fill256(int32Array) {
  var i = 255;
  do {
    int32Array[i] = 0;
  } while(--i >= 0);
}

gets optimized to machine code like this:

# rdx is int32Array
0x13f38bcef7a    5a  488b4a2f       REX.W movq rcx,[rdx+0x2f]
0x13f38bcef7e    5e  488b7a3f       REX.W movq rdi,[rdx+0x3f]
0x13f38bcef82    62  4c8b4237       REX.W movq r8,[rdx+0x37]
# peeled iteration
0x13f38bcef86    66  4881f9ff000000 REX.W cmpq rcx,0xff
0x13f38bcef8d    6d  0f8614010000   jna 0x13f38bcf0a7  <+0x187>
0x13f38bcef93    73  4e8d0c07       REX.W leaq r9,[rdi+r8*1]
0x13f38bcef97    77  41c781fc03000000000000 movl [r9+0x3fc],0x0  # dword store
0x13f38bcefa2    82  41b9fe000000   movl r9,0xfe
0x13f38bcefa8    88  e906000000     jmp 0x13f38bcefb3  <+0x93>
0x13f38bcefad    8d  0f1f00         nop

# loop proper
0x13f38bcefb0    90  4d8bcb         REX.W movq r9,r11
 # first iteration entry point:
0x13f38bcefb3    93  493b65e0       REX.W cmpq rsp,[r13-0x20] (external value (StackGuard::address_of_jslimit()))
0x13f38bcefb7    97  0f868b000000   jna 0x13f38bcf048  <+0x128>
0x13f38bcefbd    9d  458d59ff       leal r11,[r9-0x1]
0x13f38bcefc1    a1  4d63e1         REX.W movsxlq r12,r9
0x13f38bcefc4    a4  4c3be1         REX.W cmpq r12,rcx
0x13f38bcefc7    a7  0f83e6000000   jnc 0x13f38bcf0b3  <+0x193>
0x13f38bcefcd    ad  4e8d0c07       REX.W leaq r9,[rdi+r8*1]
0x13f38bcefd1    b1  43c704a100000000 movl [r9+r12*4],0x0       # dword store
0x13f38bcefd9    b9  4183fb00       cmpl r11,0x0
0x13f38bcefdd    bd  7dd1           jge 0x13f38bcefb0  <+0x90>

This is not specific to the particular loop construct, but is seemingly done for all loops with small bodies. The V8 source code comments just say this is an optimization but what does it actually accomplish other than bloating the code size?

I know that peeling can be beneficial if it introduces new invariants.

Are you sure this is V8's fully optimized final version of the loop? It's horrible! 3 conditional branches per iteration instead of checking the array bounds once before the loop, and redoing so much work every iteration (sign-extension of the index, and `leaq r9,[rdi+r8*1]` calculation of the array base address). I was hoping the peeled dword store was going to set up for an even number of stores allowing unrolling or a qword store, but no, this is just trash. — Peter Cordes, May 12 '20 at 14:15
@PeterCordes I ran the the function 2000 times with alternating input arrays, so I doubt it would get optimized any further past that point. There are indeed some missed optimizations in the compiled loop, but addressing those would cause some more work for the compiler. I wonder about the loop peeling specifically because it would be *easier* to just not do it, so there has to be a good reason. — Stefan Friesel, May 12 '20 at 14:43
This *has* to be the worst-optimized `bzero()` ever compiled. — EOF, May 12 '20 at 14:49
I understood the question; I was hoping that there might be a better version of the asm later where peeling did prove useful. IDK if only 2000 calls is sufficient; I'd try millions, giving it at least a second, unless you're sure that won't matter. — Peter Cordes, May 12 '20 at 14:52
For loops with a trip-count that might be 0 or 1, peeling can certainly help, but here its a compile-time constant. Maybe it still makes sense for V8 to just always peel the first iteration without checking? As you say, tradeoff between time spent JITting vs. time spent running. Is it possible that V8 does poorly on `do{}while()` loops? Maybe it's used to just peeling the loop condition to check for 0 trips before entering the loop, but do{}while is [already in idomatic-for-asm form](https://stackoverflow.com/q/47783926), unlike a `for(){}` loop. — Peter Cordes, May 12 '20 at 14:53
@PeterCordes I tried with 100 million iterations (16 seconds) and the code did not get optimized further. — Stefan Friesel, May 12 '20 at 15:15
I think I'm getting close to an answer. I just found that v8 has an option `--no-turbo-loop-peeling`. Disabling loop peeling makes the generated code significantly worse, adding two additional conditional jumps in the loop body. Maybe the ability to hoist any invariant code in V8 depends on loop peeling. — Stefan Friesel, May 12 '20 at 16:06
Changing the first loop iteration to do nothing has the same effect as providing the flag. The compiler peels the noop iteration and then seems to be unable to hoist even the most trivial invariant operations from the remaining loop. — Stefan Friesel, May 12 '20 at 16:20
@PeterCordes JS arrays have no size. Maybe that's why there is a `cmp r12, rcx` check on the index and a reload of the base on every loop (in case the buffer backing the array gets reallocated). And maybe also why there a `address_of_jslimit` check on `rsp`. I don't know, TBH. But JS is a complex language and one should first rule out any possible language artifact (for example by using an `ArrayBuffer` or maybe even by using an array in such a way V8 can infer its size). — Margaret Bloom, May 13 '20 at 07:56
@MargaretBloom: That's a good point, but the upper and lower bounds of the range that gets accessed is known ahead of the loop, so could be checked once to ensure allocation. i.e. hoisting the checks should be easy, if the compiler bothered to try. But yeah, maybe it doesn't. — Peter Cordes, May 13 '20 at 12:29
@MargaretBloom The input is a [TypedArray](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray) (Int32Array) and those do have a [fixed size](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray/length#Description) after creation. This is why the code does not reload rcx from [rdx+0x2f] on each iteration. The check on address_of_jslimit in this case is just a vehicle of the runtime to be able to interrupt the loop at a defined point and is common to all loops. — Stefan Friesel, May 14 '20 at 06:06
@StefanFriesel Good point, I didn't pay attention to the name of the parameter :) — Margaret Bloom, May 14 '20 at 10:12

score 2 · Answer 1 · answered May 14 '20 at 09:50

It seems like peeling the first iteration is a requirement in turbofan to hoist loop statements.

This presentation by the compiler tech lead seems to indicate that peeling was a new way of more safely hoisting code. Slide 17:

Loop peeling: No more deoptimization loops because of aggressive hoisting

And indeed defeating the peeling step has a huge impact on performance. This:

// var buf = new Int32Array(10000);
for (var i = 0; i < 10; ++i) {
  if (i === 0) continue;
  for (var j = 0; j < 1000; ++j) {
    if (j === 0) continue;
    buf[i*j] = i ^ j;
  }
}

is three times slower than this:

for (var i = 0; i < 10; ++i)
  for (var j = 0; j < 1000; ++j)
    buf[i*j] = i ^ j;

because the type and range checks on buf remain in the inner loop.

Why does turbofan peel small loops?

1 Answers1