10

How can I get the top-n (lets say top 10 or top 3) per group in spark-sql?

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ provides a tutorial for general SQL. However, spark does not implement subqueries in the where clause.

lovasoa
  • 6,419
  • 1
  • 35
  • 45
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

10

You can use the window function feature that was added in Spark 1.4 Suppose that we have a productRevenue table as shown below. enter image description here

the answer to What are the best-selling and the second best-selling products in every category is as follows

SELECT product,category,revenue FROM 
   (SELECT product,category,revenue,dense_rank() 
         OVER (PARTITION BY category ORDER BY revenue DESC) as rank 
    FROM productRevenue) tmp 
WHERE rank <= 2

Tis will give you the desired result

wlad
  • 2,073
  • 2
  • 18
  • 29
tyagi
  • 265
  • 2
  • 10
  • This works great in scala. However as SQL strings this fails with a strange error as described here https://gist.github.com/geoHeil/3dff11860ae042792cea6970447c4592 failure: ``union'' expected but `(' found – Georg Heiler Apr 16 '16 at 14:24
  • 2
    Solution is: http://stackoverflow.com/questions/31786912/spark-failure-union-expected-but-found – Georg Heiler Apr 16 '16 at 14:58