0

Maybe I'm not looking/searching for the right keywords (I couldn't find a solution).

I am trying to calculate the median for a list of numbers (which is continually updated) in a space efficient way.

For calculating the mean there is a nice method by memorizing the number of elements in a list and weighting the old mean. For example (pseudocode):

// Initialize values
noList   = [8,10,4,6]
mean     = 0
noItems  = 0

// Now we want to update the mean continually with further values.
for (value : noList) {
  mean    = (noItems / (noItems + 1)) * mean + (1 / (noItems + 1)) * value
  noItems = noItems + 1
}

// After iteration 1: wholeList = [8]       ; mean = 8   ; noItems = 1
// After iteration 2: wholeList = [8,10]    ; mean = 9   ; noItems = 2
// After iteration 3: wholeList = [8,10,4]  ; mean = 7.33; noItems = 3
// After iteration 4: wholeList = [8,10,4,6]; mean = 7   ; noItems = 4

Question: Is there a similar (space-efficient) method to calculate the median?

UPDATED I updated the question (thanks to @WillemVanOnsem). I' not only looking for continually updating the median, but also a space-efficient way. According to his hint, we can keep two datastructures.

Example:

// 1) We have a list for which we want to find the median.
noList   = [9,10,4,6,13,12]

// 2) We devide it into two list or datastructures (additionally we sort it).
smallerList = [4,6,9]
biggerList  = [10,12,13]

// 3) Both list have the same length, so the median is between the last element of smallerList und the first element of biggerList.
median = (9 + 10) / 2 = 9.5

// 4) Next, we add a further element and want to update our median.
// We add the number 5 to our datastructures. So the new list is:
noList   = [9,10,4,6,13,12,5]

// 5) Obviously 5 is smaller than our current median of 9.5. So we insert it in a sorted way into smallerList:
smallerList = [4,5,6,9]
biggerList  = [10,12,13]

// 6) Now length(smallerList) > length(biggerList), So, we know, that the updated median should be the last element of smallerList.
median = 9

// 7) Next, we add a further element and want to update our median.
// We add the number 2 to our datastructures. So the new list is:
noList   = [9,10,4,6,13,12,5,2]

// 8) Obviously 2 is smaller than our current median of 9. So we insert it again in a sorted way into smallerList:
smallerList = [2,4,5,6,9]
biggerList  = [10,12,13]

// 9) Now the length of smallerList is much bigger than the length of biggerList and we need to "balance" our list by taking one element from one list and inserting it into the other list.
// We remove the element 9 from smallerList and insert it into biggerList.
smallerList = [2,4,5,6]
biggerList  = [9,10,12,13]

// 10) Both list have the same length, so the median is between the last element of smallerList und the first element of biggerList.
median = (6 + 9) / 2 = 7.5

Hope, this makes it clear. I guess, this was your hint (@WillemVanOnsem).

Yeah, this might answer my initial question... but the problem with this solution is, that both lists (smallerList and biggerList) might grow to considerable size. Let's say we have a stream of 10^18 numbers and we want to find the median for all numbers without getting out of memory. How to solve this problem in a space-efficient way?

road42
  • 77
  • 5
  • 1
    Yes, you keep two datastructures, elements that are smaller, and elements that are bigger, those lists should be in harmony (same number of elements). If one gets bigger, you shift the current median to the other list, and pick one from the former. – Willem Van Onsem Jun 16 '19 at 18:05
  • Thanks, great hint (@WillemVanOnsem)... I updated my initial question. I am also looking for a space-efficient solution. – road42 Jun 16 '19 at 19:04
  • Possible duplicate of [Find median value from a growing set](https://stackoverflow.com/questions/1387497/find-median-value-from-a-growing-set) – josejuan Jun 16 '19 at 19:27
  • My first impression is you have to store all the elements to do an on-line median. As opposed to the mean, it can be highly non-linear. If you bound the size, I guess you could forget a few. – Neil Jun 16 '19 at 19:49
  • I guess, I cannot get around storing all the elements (I hope I'm wrong). Larry Denenberg in the thread posted by @josejuan explains some ways how to find the median in his draft paper (http://denenberg.com/omf.pdf). Unfortunately it's only an approximation. I'm cheking it out now. – road42 Jun 16 '19 at 20:14
  • 1
    Are the numbers integers? What are the maximum and minimum possible values? – samgak Jun 16 '19 at 21:04

1 Answers1

3

There is no way to do this without remembering all the numbers you've seen, because at any point, any of the numbers you've seen in the past could become the median in the future.

If you've seen n numbers so far, then for any i, the ith smallest one of them could become the median:

  • If i > n/2, then it will happen if the next 2i - n numbers are larger.

  • If i <= n/2, then it will happen if the next n - 2i + 1 numbers are smaller.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87