PHP interpreter micro-optimizations in code

Question

By stumbling on this so thread i decided to write similar test in PHP. My test code is this:

// Slow version
$t1 = microtime(true);
for ($n = 0, $i = 0; $i < 20000000; $i++) {
    $n += 2 * ($i * $i);
}
$t2 = microtime(true);
echo "n={$n}\n";

// Optimized version
$t3 = microtime(true);
for ($n = 0, $i = 0; $i < 20000000; $i++) {
    $n += $i * $i;
}
$n *= 2;
$t4 = microtime(true);
echo "n={$n}\n";

$speedup = round(100 * (($t2 - $t1) - ($t4 - $t3)) / ($t2 - $t1), 0);
echo "speedup: {$speedup}%\n";

Results

in PHP 2 * ($i * $i) version runs quite similar like 2 * $i * $i,
so PHP interpreter isn't optimizing bytecode as JVM in Java
Even when I optimized code manually - I've got ~ 8% speedup, when Java's version gets ~ 16% speedup. So PHP version gets about 1/2 speedup factor of that in Java's code.

Rationale for optimization

I will not go into many details, but ratio of multiplications in optimized and un-optimized code is ->

1 summation : 3/4
2 summations: 4/6
3 summations: 5/8
4 summations: 6/10
...

And in general:

where n is number of summations in a loop. To be formula useful to us - we need to calculate limit of it when N approaches infinity (to replicate situation that we do A LOT of summations in a loop). So :

So we get conclusion that in optimized code there must be 50% less multiplications.

Questions

Why PHP interpreter isn't applying code optimization ?
Why PHP speedup factor is just half of that in Java ?

Agnius Vasiliauskas · Accepted Answer · 2018-12-07T14:12:15.367

0

It's time to analyze PHP opcodes which are generated by PHP interpreter. For that you need to install VLD extension and use it from command line to generate opcodes of php script at hand.

Opcode analysis

Seems that $i++ is not the same as ++$i in terms of opcodes and memory usage. Statement $i++; generates opcodes:

 POST_INC ~4 !1
 FREE     ~4

Increases counter by 1 and saves previous value into memory slot #4. Then, because this value is never used - frees it from memory. Question - why do we need to store value if it is never used ?

Seems that indeed there is a loop penalty, so we can gain additional performance by performing loop unrolling.

Optimized test code

Changing POST_INC into ASSIGN_ADD (which don't saves additional info in memory) and performing loop unrolling, gives use such test code :

while (true) {

// Slow version
$t1 = microtime(true);
for ($n = 0, $i = 0; $i < 2000; $i+=10) {
    // loop unrolling
    $n += 2 * (($i+0) * ($i+0));
    $n += 2 * (($i+1) * ($i+1));
    $n += 2 * (($i+2) * ($i+2));
    $n += 2 * (($i+3) * ($i+3));
    $n += 2 * (($i+4) * ($i+4));
    $n += 2 * (($i+5) * ($i+5));
    $n += 2 * (($i+6) * ($i+6));
    $n += 2 * (($i+7) * ($i+7));
    $n += 2 * (($i+8) * ($i+8));
    $n += 2 * (($i+9) * ($i+9));
}
$t2 = microtime(true);
echo "{$n}\n";

// Optimized version
$t3 = microtime(true);
for ($n = 0, $i = 0; $i < 2000; $i+=10) {
    // loop unrolling
    $n += ($i+0) * ($i+0);
    $n += ($i+1) * ($i+1);
    $n += ($i+2) * ($i+2);
    $n += ($i+3) * ($i+3);
    $n += ($i+4) * ($i+4);
    $n += ($i+5) * ($i+5);
    $n += ($i+6) * ($i+6);
    $n += ($i+7) * ($i+7);
    $n += ($i+8) * ($i+8);
    $n += ($i+9) * ($i+9);
}
$n *= 2;
$t4 = microtime(true);
echo "{$n}\n";

$speedup = round(100 * (($t2 - $t1) - ($t4 - $t3)) / ($t2 - $t1), 0);
$table[$speedup]++;

echo "****************\n";
foreach ($table as $s => $c) {
  if ($s >= 0 && $s <= 20)
     echo "$s,$c\n";
}

}

Results

Script aggregates number of times CPU hit into one or other speedup value. When CPU hits vs Speedup is drawn as a graph, we get such picture:

So it is most likely that script will get 10% speedup. This means that our optimizations resulted in +2% speedup (compared to original scripts 8%).

Expectations

I'm pretty sure that all these things i've done - could be done automatically by a PHP JIT'er. I don't think that it's hard to change automatically a pair of POST_INC/FREE opcodes into one PRE_INC opcode when generating binary executable. Also it's not a miracle that PHP JIT'er could apply loop unrolling. And this is just a start of optimizations !

Hopefully there will be a JIT'er in PHP 8.0

edited Dec 07 '18 at 14:12

answered Dec 05 '18 at 18:09

Agnius Vasiliauskas

10,935
5
50
70

"outperformed" is a weird way to phrase it. You're only showing relative speedups, not absolute. Not slowing down from extra work in the loop is normally a *good* thing (e.g. because of a smart optimizer that sinks the multiply out of the loop for you, or because it creates efficient asm that does it for free as part of adding, e.g. using LEA instead of ADD when compiling to x86 asm). – Peter Cordes Dec 06 '18 at 03:50
Don't nag because of terminology. I just had in mind that 16 > 8, that's it. And, yes I agree that not slowing down from extra work in loop is a good thing. BUT this to happen we need a decent optimizer as you said, and _having JIT'er in PHP_ will give more choices for optimization. BTW, 0% speedup can also mean that **both loops runs slowly**, because bottleneck is somewhere else (not in extra operation) - type conversions and etc (you know that yourself, I believe) – Agnius Vasiliauskas Dec 06 '18 at 08:16
Yes, my main reason for complaining about *only* showing relative speedups is that 0% can mean both run slowly. It happens to work out in *this* one case that languages that compile to machine code get bigger speedups, but I don't think it's reasonable to draw that conclusion for the general case. There's no obvious causal link or mechanism here. Java gets a bigger speedup because its JIT compiler is mediocre and shoots itself in the foot (with too much unrolling and missing optimizations), not because it's *good*. C depends on the compiler (because there are multiple good ones). – Peter Cordes Dec 06 '18 at 08:24
I'm guessing that there are cases we could construct where a good optimizing compiler can optimize away a lot of stuff (so a more complicated way of writing a loop body doesn't cause a slowdown), but an interpreter doesn't get rid of it and is slower. I just think the main point of your conclusion isn't justified by anything beyond happening to fit the data in this one test. – Peter Cordes Dec 06 '18 at 08:27
I'm not trying to draw any kind of reliable conclusions from this test. Personally for me this test gave me more questions than answers. I understand that there are a lot different factors - interpreter quality, compiler quality and finally jit'er quality. The only reasonable conclusion is that if in this particular case _compiling wins_ - there must be other situations where compiling wins too. Aren't you trying to say that this test is so unique that this outcome happens only in one of billions other use-cases ? – Agnius Vasiliauskas Dec 06 '18 at 09:04
I think a good answer to this question would look at PHP interpreter internals and figure out why the speedup is so small: either it can multiply by a constant 2 quite efficiently, or just loop + other overhead is always really high and dwarfs the cost of the addition + multiplication in the loop body. IMO the "speedup" is unexpectedly low in PHP, so that's the mystery. One more operation to evaluate should be making a bigger difference, unless it gets partially optimized away at some point, unless the cost of looping is high. – Peter Cordes Dec 06 '18 at 09:30
And BTW, you're comparing different things for different languages. The Java 16% number is between `2*(i*i)` and `2*i*i`, isn't it? vs. the PHP is for sinking the `2*` in the source. And the C number is now for `signed int` with `3*i*i` instead of `2*i*i`, so less work can be folded into one LEA (shift-and-add instruction). Plus the C has signed-overflow undefined behaviour, so it depends on the compiler whether it notices and still emits the asm you expect or not. – Peter Cordes Dec 06 '18 at 09:36
Anyway sure, there will be some cases where hosting or sinking an operation out of a loop in the source is useful, because we have concrete proof that compilers and interpreters in various languages can miss such optimizations. But I don't think this answer is useful in the general case for guessing what the speedup might be in other cases. And speculating that it will become (relatively) *more* important with JITed PHP8.0 is just pure guesswork. Maybe, maybe not. What you call a "win" is a measure of how much you can gain from manual source optimization. – Peter Cordes Dec 06 '18 at 09:40
I'm not preventing others to make a deeper research into PHP internals. Others are welcome to do that (including YOU :-) ) – Agnius Vasiliauskas Dec 06 '18 at 09:42
I'm not saying that "it will become more important with JITed PHP" - you mis-understood me. I'm just saying that it's important to have a JIT'er for PHP, because in _some_ cases it _can_ produce more efficient code. That's it – Agnius Vasiliauskas Dec 06 '18 at 09:48
Sure, a JIT-compiler for PHP will improve the *absolute* performance. But this question, and your answer, isn't talking about absolute performance, or PHP vs. C. It's talking about how much there is to gain from sinking a `2*` out of a loop within the same language / implementation, relative performance. This is part of why I was complaining about terminology and "outperformed" earlier. But if that's what you mean in the 2nd half of the answer, then it's totally unrelated to the first half of the answer. – Peter Cordes Dec 06 '18 at 09:51
Even in theory we _can't_ compare code of different languages at all, because they have a _different syntax_. So comparison is just some sort of game. PHP test code is a bit different than Java's version, because PHP is not doing extra multiplication optimization, so i needed to **force** optimization for PHP to be sure that i'm comparing same semantics for Java and PHP (with optimization turned ON both in Java and PHP). Because my goal is to compare _extra operation cost_ optimizations. Otherwise we will get ~ 0% speedup in PHP, because it skips code optimization in this case. – Agnius Vasiliauskas Dec 06 '18 at 10:04
You could do the same source optimization in Java. The HotSpot JIT compiler is *not* sinking it out of the loop for you, it's just doing it less inefficiently because it doesn't need `2*i` and `i` in different registers at the same time when it's unrolling. – Peter Cordes Dec 06 '18 at 10:11
1

I have looked into PHP internals. Seems that your second scenario was right - loop overhead and other problems with memory management, bury cost of extra operation – Agnius Vasiliauskas Dec 07 '18 at 14:30

PHP interpreter micro-optimizations in code

Results

Rationale for optimization

Questions

1 Answers1

Opcode analysis

Optimized test code

Results

Expectations

Linked