Background
I was looking into the statistics.median()
function (link) in the Standard Library and decided to see how it was implemented in the source code. To my surprise, the median is calculated by sorting the entire data set and returning the "middle value".
Example:
def median(data):
data = sorted(data)
n = len(data)
if n == 0:
raise StatisticsError("no median for empty data")
if n % 2 == 1:
return data[n // 2]
i = n // 2
return (data[i - 1] + data[i]) / 2
This is an okay implementation for smaller data sets, but with large data sets, this can be costly.
So, I went through numerous sources and decided that the algorithm designed by Floyd and Rivest (link) would be the best for finding the median. Some of the other algorithms I saw are:
- Quickselect
- Introselect
I chose the Floyd-Rivest algorithm because it has an amazing average time complexity and seems resistant to cases such as the Median of 3s Killer Sequence.
Floyd-Rivest Algorithm
Python 3.10 with type hints
from math import (
exp,
log,
sqrt)
def sign(value: int | float) -> int:
return bool(value > 0) - bool(value < 0)
def swap(sequence: list[int | float], x: int, y: int) -> None:
sequence[x], sequence[y] = sequence[y], sequence[x]
return
def floyd_rivest(sequence: list[int | float], left: int, right: int, k: int) -> int | float:
while right > left:
if right - left > 600:
n: int = right - left + 1
i: int = k - left + 1
z: float = log(n)
s: float = 0.5 * exp(2 * z / 3)
sd: float = 0.5 * sqrt(z * s * (n - s) / n) * sign(i - n / 2)
new_left: int = max((left, int(k - i * s / n + sd)))
new_right: int = min((right, int(k + (n - i) * s / n + sd)))
floyd_rivest(sequence, new_left, new_right, k)
t: int | float = sequence[k]
sliding_left: int = left
sliding_right: int = right
swap(sequence, left, k)
if sequence[right] > t:
swap(sequence, left, right)
while sliding_left < sliding_right:
swap(sequence, sliding_left, sliding_right)
sliding_left += 1
sliding_right -= 1
while sequence[sliding_left] < t:
sliding_left += 1
while sequence[sliding_right] > t:
sliding_right -= 1
if sequence[left] == t:
swap(sequence, left, sliding_right)
else:
sliding_right += 1
swap(sequence, right, sliding_right)
if sliding_right <= k:
left = sliding_right + 1
if k <= sliding_right:
right = sliding_right - 1
return sequence[k]
def median(data: Iterable[int | float] | Sequence[int | float]) -> int | float:
sequence: list[int | float] = list(data)
length: int = len(sequence)
end: int = length - 1
midpoint: int = end // 2
if length % 2 == 1:
return floyd_rivest(sequence, 0, end, midpoint)
return (floyd_rivest(sequence, 0, end, midpoint) + floyd_rivest(sequence, 0, end, midpoint + 1)) / 2
Question
Apparently, the Floyd-Rivest algorithm does not perform as well with nondistinct data, for example, a list containing multiple 1s: [1, 1, 1, 1, 2, 3, 4, 5]
. However, this was studied and seemingly solved by a person named Krzysztof C. Kiwiel who wrote a paper titled, "On Floyd and Rivest's SELECT algorithm". They modified the algorithm to perform better with nondistinct data.
My question is how would I implement/code Kiwiel's modified Floyd-Rivest algorithm?
Extra Considerations
In Kiwiel's paper, they also mention a non-recursive version of the algorithm. If you feel inclined, it would be nice to have an iterative algorithm to prevent overflowing stack frames (deep recursion). I am aware that a stack can be mimicked, but if you can find a way to rewrite the algorithm in such a way that it is elegantly written iteratively, that would be preferable.
Lastly, any input on speeding up the algorithm or using alternative ("better") algorithms is welcome! (I know Numpy has a median function and I know languages such as C would be more performant, but I am looking to find the "best" algorithm logic and not just generically making it faster)