Scala: How to get the Top N elements of an Iterable with Grouping (or Binning)

Question

I have used the solution mentioned here to get the top n elements of a Scala Iterable, efficiently.

End example:

scala> val li = List (4, 3, 6, 7, 1, 2, 9, 5)
li: List[Int] = List(4, 3, 6, 7, 1, 2, 9, 5)

scala> top (2, li)
res0: List[Int] = List(2, 1)

Now, suppose I want to get the top n elements with a lower resolution. The range of integers may somehow be divided/binned/grouped to sub-ranges such as modulo 2: {0-1, 2-3, 4-5, ...}, and in each sub-range I do not differentiate between integers, e.g. 0 and 1 are all the same to me. Therefore, the top element in the above example would still be 1, but the next element would either be 2 or 3. More clearly these results are equivalent:

scala> top (2, li)
res0: List[Int] = List(2, 1)

scala> top (2, li)
res0: List[Int] = List(3, 1)

How do I change this nice function to fit these needs?
Is my intuition correct and this sort should be faster? Since the sort is on the bins/groups, then taking all or some of the elements of the bins with no specific order until we get to n elements.

Comments:

The binning/grouping is something simple and fixed like modulo k, doesn't have to be generic like allowing different lengths of sub-ranges
Inside each bin, assuming we need only some of the elements, we can just take first elements, or even some random elements, doesn't have to be some specific system.

So you can scan the list and keep only the first element for any particular bin? In which case it will be faster if there are multiple original elements in at least some bins, and the test for bin-membership (and testing whether we already have an entry for a particular bin) doesn't exceed the time saved from doing topN of a smaller array. If you can enumerate the bins beforehand, you can use radix sort too — The Archetypal Paul, Jan 26 '16 at 19:42
No, the requirement is for n elements, not necessarily 1 element from every bin. If n = 5 for example, and the first bin in order has 2 elements in the list, both of these elements should be in the top 5, only then we go to the next bin. If it has 3 elements in the list we're done. — Giora Simchoni, Jan 26 '16 at 19:54
Ah, OK. That wasn't clear. So you can't discard any elements due to "binning", so it won't be much faster. It might be slightly faster, because you are comparing less than N items sometimes to know if the current one needs to be kept or not, but then there's the complexity of assigning to bins.... You have to be prepared to handle N bins in your current "top N" array, just in case there's only one element in each of the final result bins, so certainly the worst case is slightly slower, too. — The Archetypal Paul, Jan 26 '16 at 20:52
Basically, you're just changing the comparison function to not compare values directly, but compare the values after mapping an element to one value that's the same for all elements in the bin. — The Archetypal Paul, Jan 26 '16 at 20:58

som-snytt · Accepted Answer · 2016-01-27T10:32:08.777

1

Per the comment, you're just changing the comparison.

In this version, 4 and 3 compare equal and 4 is taken first.

object Firstly extends App {

  def firstly(taking: Int, vs: List[Int]) = {
    import collection.mutable.{ SortedSet => S }

    def bucketed(i: Int) = (i + 1) / 2

    vs.foldLeft(S.empty[Int]) { (s, i) =>
      if (s.size < taking) s += i
      else if (bucketed(i) >= bucketed(s.last)) s
      else {
        s += i
        s -= s.last
      }
    }
  }

  assert(firstly(taking = 2, List(4, 6, 7, 1, 9, 3, 5)) == Set(4, 1))
}

Edit: example of sorting buckets instead of keeping sorted "top N":

scala> List(4, 6, 7, 1, 9, 3, 5).groupBy(bucketed).toList.sortBy {
     | case (i, vs) => i }.flatMap {
     | case (i, vs) => vs }.take(5)
res10: List[Int] = List(1, 4, 3, 6, 5)

scala> List(4, 6, 7, 1, 9, 3, 5).groupBy(bucketed).toList.sortBy {
     | case (i, vs) => i }.map {
     | case (i, vs) => vs.head }.take(5)
res11: List[Int] = List(1, 4, 6, 7, 9)

Not sure which result you prefer, of the last two.

As to whether sorting buckets is better, it depends how many buckets.

edited Jan 27 '16 at 10:32

answered Jan 26 '16 at 22:02

som-snytt

39,429
2
47
129

Works, thanks. However I'm concerned there is no usage here of the fact that sorting needs only occur on buckets, we iterate over the entire `vs`. Imagine the list length to be 1M, and no. of buckets is 100, and we want the top 5 elements. Surely we just need to bucket the 1M elements into 100 buckets, sort the *buckets*, get all elements in bucket1, then bucket2, until we get to 100 elements (we do not iterate over elements inside a given bucket). – Giora Simchoni Jan 27 '16 at 08:37
Correction: until we get to 5 elements – Giora Simchoni Jan 27 '16 at 09:19
1

It's not obvious that sorting 100 buckets is cheaper than keeping a sorted set of five elements. It's especially not obvious if the cost is multiple traversals. – som-snytt Jan 27 '16 at 10:36

score 0 · Answer 2 · answered Jan 26 '16 at 18:34

0

How about mapping with integer division before using the original algorithm?

def top(n: Int, li: List[Int]) = li.sorted.distinct.take(n)

val li = List (4, 3, 6, 7, 1, 2, 9, 5)

top(2, li) // List(1, 2)

def topBin(n: Int, bin: Int, li: List[Int]) =
  top(n, li.map(_ / bin))  // e.g. List(0, 1)
  .map(i => (i * bin) until ((i + 1) * bin))

topBin(2, 2, li)  // List(0 to 1, 2 to 3)

answered Jan 26 '16 at 18:34

0__

66,707
21
171
266

Needs a bit work, the `topBin` function should not return a list of Ranges, but a list of n actual Ints from the `li` List, like in the original `top` function. Thanks! – Giora Simchoni Jan 26 '16 at 19:24

Scala: How to get the Top N elements of an Iterable with Grouping (or Binning)

2 Answers2