3

Trying to find an efficient way to obtain the top N items in a very large list, possibly containing duplicates.

I first tried sorting & slicing, which works. But this seems unnnecessary. You shouldn't need to sort a very large list if you just want the top 20 members. So I wrote a recursive routine which builds the top-n list. This also works, but is very much slower than the non-recursive one!

Question: Which is my second routine (elite2) so much slower than elite, and how do I make it faster ? My code is attached below. Thanks.

import scala.collection.SeqView
import scala.math.min
object X {

    def  elite(s: SeqView[Int, List[Int]], k:Int):List[Int] = {
        s.sorted.reverse.force.slice(0,min(k,s.size))
    }

    def elite2(s: SeqView[Int, List[Int]], k:Int, s2:List[Int]=Nil):List[Int] = {
        if( k == 0 || s.size == 0) s2.reverse
        else {
            val m = s.max
            val parts = s.force.partition(_==m)
            val whole = if( parts._1.size > 1) parts._1.tail:::parts._2 else parts._2
            elite2( whole.view, k-1, m::s2 )
        }
    }

    def main(args:Array[String]) = {
        val N = 1000000/3
        val x = List(N to 1 by -1).flatten.map(x=>List(x,x,x)).flatten.view
        println(elite2(x,20))
        println(elite(x,20))
    }
}
Don Roby
  • 40,677
  • 6
  • 91
  • 113
k r
  • 409
  • 2
  • 6
  • 13
  • 1
    Note, that if you want to take first n elements it is better to use `.take(n)` – om-nom-nom Nov 25 '11 at 22:23
  • Am not interested in the first n elements. Interested in the top n. So { 4,3,2,5,7,1}.take(2) gives {4,3}. I want { 7,5} which are the top 2. – k r Nov 25 '11 at 22:27
  • I talked about `.slice(0,min(k,s.size))` in `elite` – om-nom-nom Nov 25 '11 at 22:30
  • Oh you want me to use take as opposed to slice. Fine.Sorry about the confusion. – k r Nov 25 '11 at 22:31
  • Here http://stackoverflow.com/q/5674741/312172 is a similar thread – user unknown Nov 26 '11 at 02:26
  • I'm not sure how to go about doing this in Scala, but you could perform a lazy Mergesort, and only evaluate the portion of the sorted list that you need. – Dan Burton Nov 26 '11 at 08:09
  • Regarding what I said before, I actually meant "quicksort". Something like this: http://stackoverflow.com/q/2692084/208257 – Dan Burton Nov 26 '11 at 08:21
  • You can also do this fairly concisely (but still efficiently) with a priority queue: http://stackoverflow.com/q/7792837/334519 – Travis Brown Nov 26 '11 at 13:15
  • possible duplicate of [Optimal algorithm for returning top k values from an array of length N](http://stackoverflow.com/questions/4956593/optimal-algorithm-for-returning-top-k-values-from-an-array-of-length-n) – Daniel C. Sobral Dec 05 '11 at 16:32

5 Answers5

4

The classic algorithm is called QuickSelect. It is like QuickSort, except you only descend into half of the tree, so it ends up being O(n) on average.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • So to find the `m` largest elements, you would perform QuickSelect once for each one, rendering a `O(mn)` algorithm. You could squeeze some extra performance out of it by selecting multiple elements at the same time. – Dan Burton Nov 26 '11 at 08:02
  • No, actually you would run QuickSelect for the `m`th one only. Then the first `m` Elements are your result. If you want them to be ordered fully, you can then sort the first `m-1` Elements completely. This takes expected `O(n + m log m)` total. – Has QUIT--Anony-Mousse Nov 26 '11 at 08:55
  • Ah yes quite right. Once you have the `m`th item, you can partition the original list with that item as the pivot, which is only O(n) extra time, assuming it hasn't been implicitly done already, which it probably has. – Dan Burton Nov 26 '11 at 09:24
  • Actually QuickSelect does the partitioning already (at least when implemented the Hoare way of a partial QuickSort). It ensures the invariant `A<=m<=B`, so when having found the final `m`th element, I also know that the elements before must all be smaller, the ones behind all larger. QuickSelect is like Quicksort, except it only makes sure the `m`th largest element is at the `m`th position and this invariant holds. It doesn't provide quarantees about the order within the sublists A and B. – Has QUIT--Anony-Mousse Nov 26 '11 at 09:31
4

Don't overestimate how big log(M) is, for a large list of length M. For a list containing a billion items, log(M) is only 30. So sorting and taking is not such an unreasonable method after all. In fact, sorting an array of integers is far faster thank sorting a list (and the array takes less memory also), so I would say that your best (brief) bet (which is safe for short or empty lists thanks to takeRight)

val arr = s.toArray
java.util.Arrays.sort(arr)
arr.takeRight(N).toList

There are various other approaches one could take, but the implementations are less straightforward. You could use a partial quicksort, but you have the same problems with worst-case scenarios that quicksort does (e.g. if your list is already sorted, a naive algorithm might be O(n^2)!). You could save the top N in a ring buffer (array), but that would require O(log N) binary search every step as well as O(N/4) sliding of elements--only good if N is quite small. More complex methods (like something based upon dual pivot quicksort) are, well, more complex.

So I recommend that you try array sorting and see if that's fast enough.

(Answers differ if you're sorting objects instead of numbers, of course, but if your comparison can always be reduced to a number, you can s.map(x => /* convert element to corresponding number*/).toArray and then take the winning scores and run through the list again, counting off the number that you need to take of each score as you find them; it's a bit of bookkeeping, but doesn't slow things down much except for the map.)

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
3

Unless I'm missing something, why not just traverse the list and pick the top 20 as you go? So long as you keep track of the smallest element of the top 20 there should be no overhead except when adding to the top 20, which should be relatively rare for a long list. Here's an implementation:

  def topNs(xs: TraversableOnce[Int], n: Int) = {
    var ss = List[Int]()
    var min = Int.MaxValue
    var len = 0
    xs foreach { e =>
      if (len < n || e > min) {
        ss = (e :: ss).sorted
        min = ss.head
        len += 1
      }
      if (len > n) {
        ss = ss.tail
        min = ss.head
        len -= 1
      }                    
    }
    ss
  }  

(edited because I originally used a SortedSet not realising you wanted to keep duplicates.)

I benchmarked this for a list of 100k random Ints, and it took on average 40 ms. Your elite method takes about 850 ms and and your elite2 method takes about 4100 ms. So this is over 20 x quicker than your fastest.

Luigi Plinge
  • 50,650
  • 20
  • 113
  • 180
  • I like your solution and will use it instead of elite2. But I am still curious why elite2 is so slow. The algorithm itself is a very simple recursive routine – k r Nov 26 '11 at 02:25
  • @Krishnan Well, with your `elite2` method, its running-time will increase linearly with the number of items you want to extract, and on each iteraton it traverses the whole list to assess its size (so try a different data structure that remembers its size), then traverses it again to partition it, creating two new large lists, then creates another list with the duplicates mixed back in. Why not take as many of them as you need, instead of just one? Anyway, basically it's just not a very efficient way of doing it. – Luigi Plinge Nov 26 '11 at 02:43
  • On the benchmark times: these are for 10 executions under a 32-bit JVM, taking 10 from a 100k-long list. Also for comparison, sorting the whole 100k list as per Rex's answer takes 190 ms. – Luigi Plinge Nov 26 '11 at 03:05
  • Instead of `ss = (e :: ss).sorted`, isn't there something like `ss = ss insert e` that can take advantage of the fact that `ss` is already a sorted list? (I'm thinking of Haskell's [Data.List.insert](http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-List.html#v:insert)) – Dan Burton Nov 26 '11 at 08:15
  • @DanBurton That would certainly be a good optimisation, but no collection type that supports this came to mind. Maybe a `DoubleLinkedList`? I might give that a try later. – Luigi Plinge Nov 26 '11 at 08:40
  • @LuigiPlinge: I was thinking just regular old (immutable, singly-linked) List, similar to the Haskell. Something like `insert(x: A, xs: List[A]){ if (compare(x, xs.head) <= 0) (x :: xs) else (xs.head :: insert(x, xs.tail)) }` (throw in the implicit ordering and handling the empty list) – Dan Burton Nov 26 '11 at 08:48
  • The performance of this solution is unimpressive for already-sorted or nearly-sorted lists. If the input is known to be random, then it's a good strategy. – Rex Kerr Nov 26 '11 at 16:09
0

I wanted a version that was polymorphic, and also allowed to compose using a single iterator. For instance, what if you wanted the top largest and smallest elements when reading from a file? Here is what I came up with:

    import util.Sorting.quickSort

    class TopNSet[T](n:Int) (implicit ev: Ordering[T], ev2: ClassManifest[T]){
      val ss = new Array[T](n)
      var len = 0

      def tryElement(el:T) = {
        if(len < n-1){
          ss(len) = el
          len += 1
        }
         else if(len == n-1){
          ss(len) = el
          len = n
          quickSort(ss)
        }
        else if(ev.gt(el, ss(0))){
          ss(0) = el
          quickSort(ss)
        }
      }
      def getTop() = {
        ss.slice(0,len)
      }
    }

Evaluating compared to the accepted answer:

val myInts = Array.fill(100000000)(util.Random.nextInt)
time(topNs(myInts,100)
//Elapsed time 3006.05485 msecs
val myTopSet = new TopNSet[In](100)
time(myInts.foreach(myTopSet.tryElement(_)))
//Elapsed time 4334.888546 msecs

So, not much slower, and certainly a lot more flexible

Tom Vacek
  • 100
  • 1
  • 1
0

Here's pseudocode for the algorithm I'd use:

selectLargest(n: Int, xs: List): List
  if size(xs) <= n
     return xs
  pivot <- selectPivot(xs)
  (lt, gt) <- partition(xs, pivot)
  if size(gt) == n
     return gt
  if size(gt) < n
     return append(gt, selectLargest(n - size(gt), lt))
  if size(gt) > n
     return selectLargest(n, gt)

selectPivot would use some technique to select a "pivot" value for partitioning the list. partition would split the list into two: lt (elements smaller than the pivot) and gt (elements greater than the pivot). Of course, you'd need to throw elements equal to the pivot in one of those groups, or else handle that group separately. It doesn't make a big difference, as long as you remember to handle that case somehow.

Feel free to edit this answer, or post your own answer, with a Scala implementation of this algorithm.

Dan Burton
  • 53,238
  • 27
  • 117
  • 198