I was looking for a quick way to calculate the median of a list of numbers and came across this:
function array_median($array) {
// perhaps all non numeric values should filtered out of $array here?
$iCount = count($array);
if ($iCount == 0) {
return null;
}
// if we're down here it must mean $array
// has at least 1 item in the array.
$middle_index = floor($iCount / 2);
sort($array, SORT_NUMERIC);
$median = $array[$middle_index]; // assume an odd # of items
// Handle the even case by averaging the middle 2 items
if ($iCount % 2 == 0) {
$median = ($median + $array[$middle_index - 1]) / 2;
}
return $median;
}
This approach using sort()
makes sense and is certainly the obvious approach. However, I was curious if a median heap would be faster. What was surprising was that when I implemented a simple median heap it is consistently significantly slower than the above method.
My simple MedianHeap class:
class MedianHeap{
private $lowerHeap;
private $higherHeap;
private $numbers = [];
public function __construct($numbers = null)
{
$this->lowerHeap = new SplMaxHeap();
$this->higherHeap = new SplMinHeap();
if (count($numbers)) {
$this->insertArray($numbers);
}
}
public function insertArray ($numbers) {
foreach($numbers as $number) {
$this->insert($number);
}
}
public function insert($number)
{
$this->numbers[] = $number;
if ($this->lowerHeap->count() == 0 || $number < $this->lowerHeap->top()) {
$this->lowerHeap->insert($number);
} else {
$this->higherHeap->insert($number);
}
$this->balance();
}
protected function balance()
{
$biggerHeap = $this->lowerHeap->count() > $this->higherHeap->count() ? $this->lowerHeap : $this->higherHeap;
$smallerHeap = $this->lowerHeap->count() > $this->higherHeap->count() ? $this->higherHeap : $this->lowerHeap;
if ($biggerHeap->count() - $smallerHeap->count() >= 2) {
$smallerHeap->insert($biggerHeap->extract());
}
}
public function getMedian()
{
if (!count($this->numbers)) {
return null;
}
$biggerHeap = $this->lowerHeap->count() > $this->higherHeap->count() ? $this->lowerHeap : $this->higherHeap;
$smallerHeap = $this->lowerHeap->count() > $this->higherHeap->count() ? $this->higherHeap : $this->lowerHeap;
if ($biggerHeap->count() == $smallerHeap->count()) {
return ($biggerHeap->top() + $smallerHeap->top())/2;
} else {
return $biggerHeap->top();
}
}
}
And then the code to benchmark:
$array = [];
for($i=0; $i<100000; $i++) {
$array[] = mt_rand(1,100000) / mt_rand(1,10000);
}
$t = microtime(true);
echo array_median($array);
echo PHP_EOL . 'Sort Median: ' . (microtime(true) - $t) . ' seconds';
echo PHP_EOL;
$t = microtime(true);
$list = new MedianHeap($array);
echo $list->getMedian();
echo PHP_EOL . 'Heap Median: '. (microtime(true) - $t) . ' seconds';
Is there something in PHP that makes using heaps for this inefficient somehow or is there something wrong with my implemenation?