For a sequence of things where the first element constitutes the key:
val things = Seq(("key_1", ("first", 1)),("key_1", ("first_second", 11)), ("key_2", ("second", 2)))
I want to count how often a key occurs and then only keep the top-k elements.
In pandas or a database I would:
- count
- join the result to the original and filter
In Scala, the first part can be handled by:
things.groupBy(identity).mapValues(_.size)
The first bit here is:
things.groupBy(_._1).mapValues(_.map( _._2 ))
But I am not sure about the second step.
In the case of the example above when looking at the top-1 keys key_1
occurs twice and is selected, therefore.
The desired outputted results are second elements of the top-k key tuples:
Seq(("first", 1),("first_second", 11))
edit
I need a solution which works for 2.11.x.