11

Given the following problem:

"Store the largest 5000 numbers from a stream of numbers"

The solution which springs to mind is a binary search tree maintaining a count of the number of nodes in the tree and a reference to the smallest node once the count reaches 5000. When the count reaches 5000, each new number to add can be compared to the smallest item in the tree. If greater, the new number can be added then the smallest removed and the new smallest calculated (which should be very simple already having the previous smallest).

My concern with this solution is that the binary tree is naturally going to get skewed (as I'm only deleting on one side).

Is there a way to solve this problem which won't create a terribly skewed tree?

In case anyone wants it, I've included pseudo-code for my solution so far below:

process(number)
{
  if (count == 5000 && number > smallest.Value)
  {
    addNode( root, number)
    smallest = deleteNodeAndGetNewSmallest ( root, smallest)
  }
}

deleteNodeAndGetNewSmallest( lastSmallest)
{
  if ( lastSmallest has parent)
  {
    if ( lastSmallest has right child)
    {
      smallest = getMin(lastSmallest.right)
      lastSmallest.parent.right = lastSmallest.right
    }
    else
    {
      smallest = lastSmallest.parent
    }
  }
  else 
  {
    smallest = getMin(lastSmallest.right)
    root = lastSmallest.right
  }
  count--
  return smallest
}

getMin( node)
{
  if (node has left)
    return getMin(node.left)
  else
    return node
}

add(number)
{
  //standard implementation of add for BST
  count++
}
Rich
  • 3,781
  • 5
  • 34
  • 56
  • You can use a balanced search tree such as AVL or similar (https://en.wikipedia.org/wiki/AVL_tree). But a heap-based solution is more natural. –  Sep 13 '15 at 16:41

2 Answers2

38

The simplest solution for this is maintaining a min heap of max size 5000.

  • Every time a new number arrives - check if the heap is smaller then 5000, if it is - add it.
  • If it is not - check if the minimum is smaller then the new element, and if it is, pop it out and insert the new element instead.
  • When you are done - you have a heap containing 5000 largest elements.

This solution is O(nlogk) complexity, where n is the number of elements and k is the number of elements you need (5000 in your case).

It can be done also in O(n) using selection algorithm - store all the elements, and then find the 5001th largest element, and return everything bigger than it. But it is harder to implement and for reasonable size input - might not be better. Also, if stream contains duplicates, more processing is needed.

Marco A.
  • 43,032
  • 26
  • 132
  • 246
amit
  • 175,853
  • 27
  • 231
  • 333
  • 3
    +1 for selection algorithm. Just want to add: STL C++ has implementation for this. Not sure about Java, though. This method of offline computation, however, need O(n) space complexity. – nhahtdh May 25 '12 at 09:42
  • Excellent point about selection algorithm. OTOH this demands O(n) memory. – valdo Jul 14 '12 at 06:48
  • 5
    You can use the selection algorithm to find the top k elements but use only O(k) memory by storing an array of 2k elements. Every time you fill the array, use the selection algorithm to drop off the lowest k. This works out to taking O(n) time regardless of the value of k, since you're doing O(n / k) selection algorithms, each of which takes time O(k). It also only uses O(k) memory. – templatetypedef Sep 23 '12 at 21:28
  • @amit in this approach lets say we get 1 from the stream towards the beginning (let's assume the stream is of Natural Numbers only) then it goes to the top of the min-heap and stays there and once the heap size(5000) is reached there will be no more replacements for the rest of the stream, so the answer might be incorrect in the sense that it is not the top 5000 numbers . – redzedi Jan 30 '15 at 06:38
  • @redzedi It will be replaced when the next number comes, Assume you have a heap of size 5000, with 1 at the top, then a 5 comes. It checks if 5 is larger than the smallest element (1) - it does, so 1 is popped out, and 5 is inserted instead of it. Remember that it is a min heap, so 1 is the "head". – amit Jan 30 '15 at 08:34
1

Use a (minimum) priority queue. Add each incoming item to the queue and when the size reaches 5,000 remove the minimum (top) element every time you add an incoming element. The queue will contain the 5,000 largest elements and when the input stops, just remove the contents. This MinPQ is also called a heap but that is an overloaded term. Insertions and deletions take about log2(N). Where N maxes out at 5,000 this would be just over 12 [log2(4096) = 12] times the number of items you are processing.

An excellent source of info is Algorithms, (4th Edition) by Robert Sedgewick and Kevin Wayne. There is an excellent MOOC on coursera.org that is based on this text.