0

I have a Hive table with account numbers and most recent updated dates. Not every account is updated each day, so I can't simply select all records from a certain day. I need to group by account number and then sort in descending order to take the most recent 2 days for each account. My script so far:

sc.setLogLevel("ERROR")
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df1 = sqlContext.sql("FROM mydb.mytable SELECT account_num, last_updated")
val DFGrouped = df1.groupBy("account_num").orderBy(desc("data_dt"))

I'm getting error on the orderBy:

value orderBy is not a member of org.apache.spark.sql.GroupedData

Any idea on what I should be doing here?

AJD
  • 70
  • 8
  • It looks like you want to order the elements within the sub-groups. If so, you might try something like: `groupBy(...).map(_.orderBy(...))` – jwvh Jul 26 '16 at 20:45
  • @jwh It won't work. `groupBy` is just logical operation. – zero323 Jul 26 '16 at 22:14

1 Answers1

1

Grouping will not work here because this is a form of the top N by group problem.

You need to use Spark SQL window functions, in particular, rank() with partition by account ID and order by date descending, followed by selecting the rows with rank <= 2.

Community
  • 1
  • 1
Sim
  • 13,147
  • 9
  • 66
  • 95