1

I was looking for a quick way to calculate the median of a list of numbers and came across this:

function array_median($array) {
  // perhaps all non numeric values should filtered out of $array here?
  $iCount = count($array);
  if ($iCount == 0) {
    return null;
  }
  // if we're down here it must mean $array
  // has at least 1 item in the array.
  $middle_index = floor($iCount / 2);
  sort($array, SORT_NUMERIC);
  $median = $array[$middle_index]; // assume an odd # of items
  // Handle the even case by averaging the middle 2 items
  if ($iCount % 2 == 0) {
    $median = ($median + $array[$middle_index - 1]) / 2;
  }
  return $median;
}

This approach using sort() makes sense and is certainly the obvious approach. However, I was curious if a median heap would be faster. What was surprising was that when I implemented a simple median heap it is consistently significantly slower than the above method.

My simple MedianHeap class:

class MedianHeap{
private $lowerHeap;
private $higherHeap;

private $numbers = [];

public function __construct($numbers = null)
{
    $this->lowerHeap = new SplMaxHeap();
    $this->higherHeap = new SplMinHeap();

    if (count($numbers)) {
        $this->insertArray($numbers);   
    }
}
public function insertArray ($numbers) {
    foreach($numbers as $number) {
        $this->insert($number);
    }
}
public function insert($number)
{
    $this->numbers[] = $number;
    if ($this->lowerHeap->count() == 0 || $number < $this->lowerHeap->top()) {
        $this->lowerHeap->insert($number);
    } else {
        $this->higherHeap->insert($number);
    }
    $this->balance();
}
protected function balance()
{
    $biggerHeap = $this->lowerHeap->count() > $this->higherHeap->count() ? $this->lowerHeap : $this->higherHeap;
    $smallerHeap = $this->lowerHeap->count() > $this->higherHeap->count() ? $this->higherHeap : $this->lowerHeap;

    if ($biggerHeap->count() - $smallerHeap->count() >= 2) {
        $smallerHeap->insert($biggerHeap->extract());
    }
}
public function getMedian()
{
    if (!count($this->numbers)) {
        return null;
    }
    $biggerHeap = $this->lowerHeap->count() > $this->higherHeap->count() ? $this->lowerHeap : $this->higherHeap;
    $smallerHeap = $this->lowerHeap->count() > $this->higherHeap->count() ? $this->higherHeap : $this->lowerHeap;

    if ($biggerHeap->count() == $smallerHeap->count()) {
        return ($biggerHeap->top() + $smallerHeap->top())/2;
    } else {
        return $biggerHeap->top();
    }
}
}

And then the code to benchmark:

$array = [];

for($i=0; $i<100000; $i++) {
     $array[] = mt_rand(1,100000) / mt_rand(1,10000);
}


$t = microtime(true);
echo array_median($array);
echo PHP_EOL . 'Sort Median: ' . (microtime(true) - $t) . ' seconds';

echo PHP_EOL;

$t = microtime(true);
$list = new MedianHeap($array);
echo $list->getMedian();
echo PHP_EOL . 'Heap Median: '. (microtime(true) - $t) . ' seconds';

Is there something in PHP that makes using heaps for this inefficient somehow or is there something wrong with my implemenation?

TheGentleman
  • 2,324
  • 13
  • 17
  • The performance issue in 2nd solution, that you using array_push (`$this->numbers[] = $number;`) for every element which is slow. Try to allocate arrays once and then use indexes. – Oleg Imanilov Feb 19 '18 at 21:47
  • @Nosyara, are you suggesting that `$this->numbers[count($this->numbers)] = $number` would be faster? Can you point to some documentation that supports that? Regardless, `$this->numbers[] = $number` is not (according to the docs) equivalent to `array_push($this->numbers, $number)`. Also, it is trivial to alter the second solution to not use `$this->numbers` at all. It does not have any effect on the benchmarks. – TheGentleman Feb 19 '18 at 21:56
  • When you push new member into array (any language) you doing few things => new allocation/copy/update of size e.t.c. If you will initially create big enough array - you will only play with references to allocated memory. – Oleg Imanilov Feb 19 '18 at 22:02
  • That is fine if you aren't dealing with a running stream. But regardless, as I said, I can completely remove the `$numbers` array from this code and it has no affect on the benchmark numbers. – TheGentleman Feb 19 '18 at 22:04
  • Your benchmarking code doesn't take into account time for interpretation (pre-compiling), and probably a whole lot of other things. The sort is optimized, compiled library code, so it has a lot of advantages. I suggest you read up on benchmarking PHP. A good place to start is https://stackoverflow.com/q/8291366/56778. I'm not saying that it will show that your code is faster than the sort, but doing the benchmark correctly will give you a much more realistic picture. – Jim Mischel Feb 21 '18 at 21:29

0 Answers0