0

For a sequence of things where the first element constitutes the key:

val things = Seq(("key_1", ("first", 1)),("key_1", ("first_second", 11)), ("key_2", ("second", 2)))

I want to count how often a key occurs and then only keep the top-k elements.

In pandas or a database I would:

  1. count
  2. join the result to the original and filter

In Scala, the first part can be handled by:

things.groupBy(identity).mapValues(_.size)

The first bit here is:

things.groupBy(_._1).mapValues(_.map( _._2 ))

But I am not sure about the second step. In the case of the example above when looking at the top-1 keys key_1 occurs twice and is selected, therefore. The desired outputted results are second elements of the top-k key tuples:

Seq(("first", 1),("first_second", 11))

edit

I need a solution which works for 2.11.x.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

2 Answers2

1

This approach first groups by the keys to get a map of the keys to original items.

You can also use an OrderedMap or PriorityQueue for more efficient top-N calculation, but if there aren't many elements, then a simple sortBy would work, too, as shown.

def valuesOfNMostFrequentKeys(things: Seq[(String, (String, Int))], N: Int = 1) = {
    val grouped: Map[String,Seq[(String, (String, Int))]] = things.groupBy(_._1)

    // "map" array of counts per keys to KV Tuples 
    val countToTuples:Array[(Int, Seq[(String, (String, Int))])]  = grouped.map((kv: (String, Seq[(String, (String, Int))])) => (kv._2.size, kv._2)).toArray
    // sort by count (first item in tuple) descending and take top N
    val sortByCount:Array[(Int, Seq[(String, (String, Int))])] = countToTuples.sortBy(-_._1)
    val topN:Array[(Int, Seq[(String, (String, Int))])] = sortByCount.take(N)

    // extract inner (String, Int) item from list of keys and values, and flatten
    topN.flatMap((kvList: (Int, Seq[(String, (String, Int))])) => kvList._2.map(_._2))
}

valuesOfNMostFrequentKeys(things)

output:

valuesOfNMostFrequentKeys: (things: Seq[(String, (String, Int))], N: Int)Array[(String, Int)]
res44: Array[(String, Int)] = Array((first,1), (first_second,11))

Note above is an Array and you may want to do toSeq -- but this works in Scala 2.11.

ELinda
  • 2,658
  • 1
  • 10
  • 9
  • The thought of using `PriorityQueue` sounds really good. Do you want to add it to the example? – Georg Heiler Feb 23 '20 at 06:54
  • In fact https://github.com/apache/spark/blob/4193d2f4cc2bb100625b073e3a2e8599c3b4cb7c/core/src/main/scala-2.12/org/apache/spark/util/BoundedPriorityQueue.scala might be even better to bound the priority queue – Georg Heiler Feb 26 '20 at 06:18
  • Using the priority queue or quicksort would replace the "sortBy" line. It is a bit involved, and I wouldn't do it myself unless there were many items to sort, otherwise, and (top) N is much less than this number. Here is an example https://stackoverflow.com/questions/5674741/simplest-way-to-get-the-top-n-elements-of-a-scala-iterable and it relies on that the initialization of a PQ from an unordered collection is still linear time complexity (see proof https://www.geeksforgeeks.org/time-complexity-of-building-a-heap/) – ELinda Feb 26 '20 at 19:06
  • You mean a: https://www.javadoc.io/doc/org.scala-lang/scala-library/2.11.8/scala/util/Sorting$.html ? What is the default implementation of the `sortBy`? – Georg Heiler Feb 27 '20 at 13:50
0

It looks like:

things.groupBy(_._1)
  .mapValues(e => (e.map(_._2).size, e.map(_._2))).toSeq.map(_._2)
  .sortBy(_._1).reverse.take(2).flatMap(_._2)

computes the desired outputs

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292