2

I have the following script, and I know about the principle "Branch prediction" but it seems that's not the case here.

Why is it faster to process a sorted array than an unsorted array?

It seems to work the other way around.

When I run the following script without the sort($data) the script takes 193.23883700371 seconds to complete.

When I enable the sort($data) line the scripts takes 300.26129794121 seconds to complete.

Why is it so much slower in PHP? I used PHP 5.5 and 5.6. In PHP 7 the script is faster when the sort() is not commented out.

<?php

$size = 32768;
$data = array_fill(0, $size, null);

for ($i = 0; $i < $size; $i++) {
    $data[$i] = rand(0, 255);
}

// Improved performance when disabled
//sort($data);

$total = 0;

$start = microtime(true);

for ($i = 0; $i < 100000; $i++) {
    for ($x = 0; $x < $size; $x++) {
        if ($data[$x] >= 127) {
            $total += $data[$x];
        }
    }
}

$end = microtime(true);
echo($end - $start);
Community
  • 1
  • 1
Sander Visser
  • 4,144
  • 1
  • 31
  • 42
  • Yes just run the script with sort($data) commented VS uncommented. The sort is not taken in account for the time measurement – Sander Visser May 12 '16 at 15:43
  • The point is that the loop is way slower after a `sort()` call – Sander Visser May 12 '16 at 15:46
  • 2
    There is no earthly reason why it should be slower processing a sorted array compared with an unsorted array; as far as PHP is concerned, they are simply arrays – Mark Baker May 12 '16 at 15:54
  • 1
    @SanderVisser How many times did you run the script? Maybe when you sorted the array, there were more digits above 127, so it took longer to sum it to the total? – Chin Leung May 12 '16 at 15:55
  • 3
    In PHP an array is an ordered map. The likely culprit is that during creation the elements are relatively aligned in memory so that $data[1] is close to $data[2] the sort may just then be reassigning the index key values without changing the memory locations of the values. Now $data[1] may be far from $data[2] as a result there are memory cache misses when accessing the values, and time is lost loading the data from memory. – John Garrard May 12 '16 at 15:57
  • Sorted is quicker here http://sandbox.onlinephpfunctions.com/code/271264e6749ae6720e175f44234816114ad1f037 – Alex Blex May 12 '16 at 15:59
  • Alex, only faster when number of items is lower. When I set it back to the same value the same value (and lower the number of times through the outer loop to compensate so the page doesn't errror) sorted runs slower. So I'd say this is definitely a memory cache issue caused by the implementation choice on the array, and sort method in PHP. – John Garrard May 12 '16 at 16:05
  • 3
    In PHP everything is a Hash table, including arrays. Sorted or not shouldn't make any difference, I think. `rand()` and the `>= 127` condition will surely affect performance. – IROEGBU May 12 '16 at 16:10
  • @ChinLeung I have runned it multiple times, locally and on a VM etc – Sander Visser May 12 '16 at 16:11
  • I guess this post answers this question as well http://stackoverflow.com/questions/12842717/why-is-processing-a-sorted-array-not-faster-than-an-unsorted-array-in-python – Richard May 12 '16 at 16:17
  • @AlexBlex Yes in PHP 7 `sort` is faster, This is because of branch prediction – Sander Visser May 12 '16 at 16:22
  • 1
    I notice that the results are similar in PHP 5 if you replace in the test the `sort` call with a `shuffle` call. So this means tampering with the original order degrades performance in PHP 5. – trincot May 12 '16 at 16:23
  • 1
    guessing, I suspect you running into 'chaining issues'. A PHP `array` has to maintain two lists - 1) The 'hash key' pointing to data (constant time) and 2) - the sequential numeric index (entry order of items into the `array`. It would be interesting to time an iterator on a 'sorted array' rather than accessing by 'index'. imo, a 'PHP array' is a rather complex structure. The iterator will return them in sorted order. – Ryan Vincent May 12 '16 at 17:20
  • @RyanVincent No difference between iterator and accessing the the array by index – Sander Visser May 12 '16 at 18:42
  • @RyanVincent Hahaha me too :) that's why have asked it. Currently reading https://nikic.github.io/2012/03/28/Understanding-PHPs-internal-array-implementation.html – Sander Visser May 12 '16 at 18:58
  • @RyanVincent Except that it's those sequential numeric indices that are adjusted by the sort, shuffle, etc.... they still exists before and after the sort, it's simply that the entries they poit to are different... and it still shouldn't affect direct access time in the for loop – Mark Baker May 12 '16 at 22:17
  • @MarkBaker, I am as puzzled as others here as to what is happening. It really does seem strange that the performance i so much slower. I have no idea why. ;-/ – Ryan Vincent May 13 '16 at 03:05
  • @RyanVincent I'm stull puzzled too, not sure if it's really memory cache misses because the "references" to the items in memory are reordered and not the items them self. But that also would mean if you sort an array, then clone that array your script will run faster. which I have tested and is true..... so `$testData = $data` and then use `$testData` in the loop will improve performance. – Sander Visser May 13 '16 at 12:03
  • Sorry I was wrong! after the sort the array is always slower. – Sander Visser May 13 '16 at 12:18

2 Answers2

2

Based on my comments above the solution is to either find or implement a sort function that moves the values so that memory remains contiguous and gives you the speedup, or push the values from the sorted array into a second array so that the new array has contiguous memory.

John Garrard
  • 167
  • 4
1

Assuming you MEANT to not time the actual sort, since your code doesn't time that action, it's difficult to assess any true performance difference because you've filled the array with random data. This means that one pass might have MANY more values greater than or equal to 127 (and thus running an additional command) then another pass. To really compare the two, fill your array with an identical set of fixed data. Otherwise, you'll never know if the random fill is causing the time differences you're seeing.

Sgt AJ
  • 790
  • 5
  • 12