Efficiently Find Median of an Unordered Set of Data

Question

Background

I was looking into the statistics.median() function (link) in the Standard Library and decided to see how it was implemented in the source code. To my surprise, the median is calculated by sorting the entire data set and returning the "middle value".

Example:

def median(data):
    data = sorted(data)
    n = len(data)

    if n == 0:
        raise StatisticsError("no median for empty data")

    if n % 2 == 1:
        return data[n // 2]

    i = n // 2
    return (data[i - 1] + data[i]) / 2

This is an okay implementation for smaller data sets, but with large data sets, this can be costly.

So, I went through numerous sources and decided that the algorithm designed by Floyd and Rivest (link) would be the best for finding the median. Some of the other algorithms I saw are:

Quickselect
Introselect

I chose the Floyd-Rivest algorithm because it has an amazing average time complexity and seems resistant to cases such as the Median of 3s Killer Sequence.

Floyd-Rivest Algorithm

Python 3.10 with type hints

from math import (
    exp,
    log,
    sqrt)


def sign(value: int | float) -> int:
    return bool(value > 0) - bool(value < 0)


def swap(sequence: list[int | float], x: int, y: int) -> None:
    sequence[x], sequence[y] = sequence[y], sequence[x]
    return


def floyd_rivest(sequence: list[int | float], left: int, right: int, k: int) -> int | float:
    while right > left:
        if right - left > 600:
            n: int = right - left + 1
            i: int = k - left + 1
            z: float = log(n)
            s: float = 0.5 * exp(2 * z / 3)
            sd: float = 0.5 * sqrt(z * s * (n - s) / n) * sign(i - n / 2)

            new_left: int = max((left, int(k - i * s / n + sd)))
            new_right: int = min((right, int(k + (n - i) * s / n + sd)))
            floyd_rivest(sequence, new_left, new_right, k)

        t: int | float = sequence[k]
        sliding_left: int = left
        sliding_right: int = right

        swap(sequence, left, k)

        if sequence[right] > t:
            swap(sequence, left, right)

        while sliding_left < sliding_right:
            swap(sequence, sliding_left, sliding_right)

            sliding_left += 1
            sliding_right -= 1

            while sequence[sliding_left] < t:
                sliding_left += 1

            while sequence[sliding_right] > t:
                sliding_right -= 1

        if sequence[left] == t:
            swap(sequence, left, sliding_right)
        else:
            sliding_right += 1
            swap(sequence, right, sliding_right)

        if sliding_right <= k:
            left = sliding_right + 1

        if k <= sliding_right:
            right = sliding_right - 1
    return sequence[k]


def median(data: Iterable[int | float] | Sequence[int | float]) -> int | float:
    sequence: list[int | float] = list(data)
    length: int = len(sequence)
    end: int = length - 1
    midpoint: int = end // 2

    if length % 2 == 1:
        return floyd_rivest(sequence, 0, end, midpoint)
    return (floyd_rivest(sequence, 0, end, midpoint) + floyd_rivest(sequence, 0, end, midpoint + 1)) / 2

Question

Apparently, the Floyd-Rivest algorithm does not perform as well with nondistinct data, for example, a list containing multiple 1s: [1, 1, 1, 1, 2, 3, 4, 5]. However, this was studied and seemingly solved by a person named Krzysztof C. Kiwiel who wrote a paper titled, "On Floyd and Rivest's SELECT algorithm". They modified the algorithm to perform better with nondistinct data.

My question is how would I implement/code Kiwiel's modified Floyd-Rivest algorithm?

Extra Considerations

In Kiwiel's paper, they also mention a non-recursive version of the algorithm. If you feel inclined, it would be nice to have an iterative algorithm to prevent overflowing stack frames (deep recursion). I am aware that a stack can be mimicked, but if you can find a way to rewrite the algorithm in such a way that it is elegantly written iteratively, that would be preferable.

Lastly, any input on speeding up the algorithm or using alternative ("better") algorithms is welcome! (I know Numpy has a median function and I know languages such as C would be more performant, but I am looking to find the "best" algorithm logic and not just generically making it faster)

1) Since Python's sort algorithm (TimSort) is written in C++, won't it outperform your code written in pure Python? 2) Another option for computing median is to use a heap such as Python's [Heapq](https://docs.python.org/3/library/heapq.html) which is also written in C++. Using a heap you can compute median in O(n) time comlexity (i.e. O(n) to construct heap, then constant time to compute median from heap). — DarrylG, Sep 07 '22 at 20:59
@DarrylG, how can you compute median from heap in constant time? — trincot, Sep 07 '22 at 21:05
@trincot I was getting that from the first line of [How do I find the median of numbers in linear time using heaps](https://stackoverflow.com/questions/2579912/how-do-i-find-the-median-of-numbers-in-linear-time-using-heaps/2611615#2611615) which sounded reasonable. This states: "You would use a min-max-median heap to find the min, max and median in constant time (and take linear time to build the heap)." — DarrylG, Sep 07 '22 at 21:19
@DarrylG, that post mentions that the median is to be found before building that heap, so there the heap is not the tool for *finding* the median. It seems more a tool for keeping track of the median in a changing data set. — trincot, Sep 07 '22 at 21:57
@DarrylG Item deletion of a heap (min or max regarding the heap) takes `O(log n)` time. This is a fundamental property of all heaps. Having a complexity lower than that would cause sorting to be done faster than `O(n log n)` which is proven not to be possible anyway. The median of an array can be found in `O(n)` time and the algorithm the OP is algorithmically optimal as one need to read all the values at least once. — Jérôme Richard, Sep 07 '22 at 23:08
You can linearize any recursive algorithm using a manual stack and a loop so to emulate recursion and fix the limited stack possible issue (though this is generally slower). In practice, AFAIK the Floyd Rivest algorithm tends not to result in a deep recursion and most recursive calls are very fast. — Jérôme Richard, Sep 07 '22 at 23:14
@JérômeRichard I found an [answer](https://softwareengineering.stackexchange.com/a/284858) on the Software Engineering Stack Exchange site that describes that the runtime of the Floyd-Rivest algorithm can be significantly reduced by not using the math functions (exp, log, and sqrt). It makes me wonder if there is a better way to write this algorithm... maybe even eliminate the need for recursion altogether. — VoidTwo, Sep 07 '22 at 23:24
This is likely a dead end. I tried to optimize this algorithm and was afraid of recursion and expensive math calls. After some time doing so I discovered that this part of the algorithm was mainly negligible (at least for an optimized C++ code). The math part can be reduced using lower precision and even good approximations. In fact, the OP of the provided answer shows that removing math operation did not improve the runtime in its own benchmark (if my understanding is correct). Besides, I am a not very convinced by the full removal of math functions from the code and the proposed alternative. — Jérôme Richard, Sep 07 '22 at 23:36
Last time I benchmarked the algorithm, it was bound by the memory access pattern (at least on large arrays). The `FloydWirth_kth` seems particularly interesting but a comment reported that it can apparently give wrong results... `select7MO3` appears to be a great optimization though. — Jérôme Richard, Sep 07 '22 at 23:41
@JérômeRichard There was some interesting research done by Andrei Alexandrescu on a paper titled, "Fast Deterministic Selection" ([link](https://github.com/gmh5225/awesome-pdfs/blob/master/Fast%20Deterministic%20Selection%20-%20Andrei%20Alexandrescu%20-%20June%202016%20(1606.00484v1).pdf)), which demonstrates an algorithm named, "QuickselectAdaptive". It seems promising but I'm not sure how it compares to Floyd-Rivest. — VoidTwo, Sep 08 '22 at 00:18
This is a noble pursuit, but O(N) written in python will likely be slower than the O(N log N) built-in sort at all practical sizes. — Matt Timmermans, Sep 08 '22 at 12:22

Efficiently Find Median of an Unordered Set of Data

Background

Floyd-Rivest Algorithm

Question

Extra Considerations

0 Answers0