Efficient median calculation for small dataset in C++

Question

I have many (hundreds of thousands, m) sets of doubles d, ~5-10 (n, constant small) long. These doubles are essentially randomly distributed. I need to get the median of each set: because m is very large, we need to calculate the median pretty quickly...these sets are pretty small though, so I think that is going to play a significant role in choosing how to do the median. I know I can use nth_element to get the median in O(n) with the selection algorithm, which I know I'm not going to beat in complexity. However, because of the small constant n, I am probably looking for the method that simply has the smallest overhead.

I have found a bunch of different ways to do the median (below) but am just curios if anyone knows the "correct" method to use here.

Min max heaps (O(n) build time, constant access, probably too much overhead)

This question from 2010 Which may be out of date (new STL/Boost code may already implement this stuff), also focuses more on time complexity than overhead.

"Hundreds of thousands" is not "many" by any reasonable standard these days. If you have a couple hundred thousands of 5-10 tuples of doubles, the algorithm won't matter. A crude Python script iterating over the lines, doing a full sort and printing the middle element should do it in a couple of seconds. — TC1, Mar 27 '13 at 16:22
Depends on the latency you are looking for, we are shooting for a "real time" solution, sub 30-60 ms, so the "crude" techniques are not what we're shooting for...if time wasn't an option, I would just go with the n_th element cuz its easy. — IdeaHat, Mar 27 '13 at 16:29
With `m` data sets (and thereby `m` medians by-circumstance) you will be hard-pressed to find the medians in under O(m*N(m)) complexity, where N(m) is the average number of nodes per set. — WhozCraig, Mar 27 '13 at 16:29
Do you know how fast `nth_element` performs with your setup? How many times faster do you need to be? — Oliver Charlesworth, Mar 27 '13 at 16:32
@WhozCraig, Yup, I'm not trying to get smaller complexity, there are a bunch of O(N(m)) median techniques. Since N(m) is small and constant, I really need the median function with the lowest constant time overhead, not (necessarily) the smallest growth complexity. — IdeaHat, Mar 27 '13 at 16:34
Well `nth_element` will probably do the QSort based k-th statistic algorithm underneath. Anything using heaps etc means allocating additional space, which might in and of itself already add more overhead than you can afford. I'd say benchmark `nth_element`, you won't get much above that. — TC1, Mar 27 '13 at 16:35
@OliCharlesworth Good point, my current metric is "as fast as possible", though I may be prematurely optimizing. — IdeaHat, Mar 27 '13 at 16:35
@MadScienceDreams The added info about `N(m)` being a constant `n` (you have lottsa-lists, but they're all `n` elements long; this was not evident in the question as-stated) is important. In that you can choose a potentially unrolled algorithm based on that size. — WhozCraig, Mar 27 '13 at 16:38
@WhozCraig Unless it's constant for all invocations, the best option will probably be to create unrolled versions for each `n` and then invoke the appropriate one when needed. Judging by the requirements, any size checking might introduce branching / mess with CPU branch prediction. — TC1, Mar 27 '13 at 16:47
@TC1 I concur. It appears it *is* a small **fixed** constant by the comments and updated question info from the OP. So I think going that route (crafting an optimal unrolled algorithm) is probably the best he'll do. In that it is not significantly different than many fixed-block encryption or hashing schemes, which pretty-much *all* use this technique (albeit their reasoning is backed by *much* more complicated code). — WhozCraig, Mar 27 '13 at 16:56
@BoBTFish, naw the array of doubles that we need to get the median are populated on a "per m" iteration. — IdeaHat, Mar 27 '13 at 18:48

score 1 · Accepted Answer · edited May 23 '17 at 12:04

1

This may not scale well to your data sizes, but it's a code snippet I found (can't remember where) and use in my image processing functions to get the median of 9 unsigned char pixels.

// optimised median search on 9 values
#define PIX_SWAP(a, b) { unsigned char uTemp = (a); (a) = (b); (b) = uTemp; }
#define PIX_SORT(a, b) { if ((a) > (b)) PIX_SWAP((a), (b)); }

unsigned char GetMedian9(unsigned char *pNine)
{
    // nb - this is theoretically the fastest way to get the median of 9 values
    PIX_SORT(pNine[1], pNine[2]); PIX_SORT(pNine[4], pNine[5]); PIX_SORT(pNine[7], pNine[8]); 
    PIX_SORT(pNine[0], pNine[1]); PIX_SORT(pNine[3], pNine[4]); PIX_SORT(pNine[6], pNine[7]); 
    PIX_SORT(pNine[1], pNine[2]); PIX_SORT(pNine[4], pNine[5]); PIX_SORT(pNine[7], pNine[8]); 
    PIX_SORT(pNine[0], pNine[3]); PIX_SORT(pNine[5], pNine[8]); PIX_SORT(pNine[4], pNine[7]); 
    PIX_SORT(pNine[3], pNine[6]); PIX_SORT(pNine[1], pNine[4]); PIX_SORT(pNine[2], pNine[5]); 
    PIX_SORT(pNine[4], pNine[7]); PIX_SORT(pNine[4], pNine[2]); PIX_SORT(pNine[6], pNine[4]); 
    PIX_SORT(pNine[4], pNine[2]); return(pNine[4]);
}

#undef PIX_SWAP
#undef PIX_SORT

EDIT - Ok, it's also referenced in this answer too

edited May 23 '17 at 12:04

Community

1
1

answered Mar 27 '13 at 16:26

Roger Rowland

25,885
11
72
113

Isn't this what nth_element does? – IdeaHat Mar 27 '13 at 16:31
1

@MadScienceDreams - if so, then I can't see how you can do better. Might be worth some profiling though. Do you need the *exact* median? There are some algorithms for giving an approximation that *might* be faster. – Roger Rowland Mar 27 '13 at 16:33
Approx may be ok, especially if I have some sense of the probability of correct median...I'll start the search now for "approximate median", but do you have any quick search seeds? – IdeaHat Mar 27 '13 at 16:42
No, I stumbled across many "approximate median" algorithms but I needed exact. Also found a nice constant-time median filter for images - i.e. independent of window size, but we're off-topic with that. – Roger Rowland Mar 27 '13 at 16:44

score 0 · Answer 2 · answered Mar 27 '13 at 17:49

0

if it is std::set (you didnt answer to BoBTFish) then it is already sorted. Hence, you will get the median by iterating to n/2 which is always better or equal O(n), typically is should be O(ld n). n-th element will not help here.

answered Mar 27 '13 at 17:49

ogni42

1,232
7
11

Efficient median calculation for small dataset in C++

2 Answers2