Spark sql top n per group

Question

How can I get the top-n (lets say top 10 or top 3) per group in spark-sql?

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ provides a tutorial for general SQL. However, spark does not implement subqueries in the where clause.

score 10 · Accepted Answer · edited Jul 14 '17 at 20:11

10

You can use the window function feature that was added in Spark 1.4 Suppose that we have a productRevenue table as shown below.

the answer to What are the best-selling and the second best-selling products in every category is as follows

SELECT product,category,revenue FROM 
   (SELECT product,category,revenue,dense_rank() 
         OVER (PARTITION BY category ORDER BY revenue DESC) as rank 
    FROM productRevenue) tmp 
WHERE rank <= 2

Tis will give you the desired result

edited Jul 14 '17 at 20:11

wlad

2,073
2
18
29

answered Apr 16 '16 at 13:26

tyagi

265
2
10

This works great in scala. However as SQL strings this fails with a strange error as described here https://gist.github.com/geoHeil/3dff11860ae042792cea6970447c4592 failure: ``union'' expected but `(' found – Georg Heiler Apr 16 '16 at 14:24
2

Solution is: http://stackoverflow.com/questions/31786912/spark-failure-union-expected-but-found – Georg Heiler Apr 16 '16 at 14:58

Spark sql top n per group

1 Answers1

Linked

Related