Hive QL - Limiting number of rows per each item

Question

If I have multiple items listed in a where clause How would one go about limiting the results to N for each item in the list?

EX:

select a_id,b,c, count(*), as sumrequests
from table_name
where
a_id in (1,2,3)
group by a_id,b,c
limit 10000

In worst case scenario you could do it with Apache PIG see example http://stackoverflow.com/questions/11534041/removing-duplicates-using-piglatin/11556857#11556857. As far as I am aware PIG can read data in Hive, see http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/HiveColumnarLoader.html — alexeipab, Aug 02 '12 at 12:16
Take a look on http://stackoverflow.com/questions/9390698/hive-getting-top-n-records-in-group-by-query. Your problem is kinda special case of that. — Ivan Klass, Jun 18 '13 at 11:00

score 6 · Answer 1 · answered Feb 06 '14 at 05:05

Sounds like your question is to get the top N per a_id. You can do this with a window function, introduced in Hive 11. Something like:

SELECT a_id, b, c, count(*) as sumrequests
FROM (
    SELECT a_id, b, c, row_number() over (Partition BY a_id) as row
    FROM table_name
    ) rs
WHERE row <= 10000
AND a_id in (1, 2, 3)
GROUP BY a_id, b, c;

This will output up to 10,000 randomly-chosen rows per a_id. You can partition it further if you're looking to group by more than just a_id. You can also use order by in the window functions, there are a lot of examples out there to show additional options.

Hive now supports a native rank() function. See https://blogs.oracle.com/taylor22/entry/hive_0_11_may_15 for a great explanation. — irhetoric, May 21 '15 at 16:52

Hive QL - Limiting number of rows per each item

1 Answers1

Linked