how can HQL limit the number of records in every group?

Question

Here's the scenario: table record is as follows:

A | B | C

1 | 1 | 1
2 | 1 | 1
3 | 1 | 1
4 | 1 | 2
5 | 1 | 2
6 | 1 | 3

the result of HQL: select * from record where B = 1 and C < 3 limit 2 would be:

A | B | C

1 | 1 | 1
2 | 1 | 1

But what I want is:

A | B | C

1 | 1 | 1
2 | 1 | 1
4 | 1 | 2
5 | 1 | 2

That is: To limit the number of record in every condition, not limit the final number of records returned.

I really need this to be done just in hive. Could anyone give me an idea? Thanks a lot!

To Summary Here's a nice way to resolve this problem: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

libjack · Answer 1 · 2013-10-17T14:54:53.877

2

You should be able to do this with UNION of queries.

SELECT A,B,C FROM
(
  SELECT A, B, C FROM record WHERE B = 1 limit 2
  UNION AL L
  SELECT A, B, C FROM record WHERE C < 3 limit 2
)

You don't say how rows which could pass both conditions should appear; can add DISTINCT to outer SELECT if required, or you might need to add to the WHERE clauses to handle the overlap conditions.

(yes, there should not be a space in ALL but, filtering protocols reject as it looks like sql injection)

Take 2

Based on comments, it seems it might be better to solve with analytics function as the goal is to get the top 2 from each grouping of B & C, with the constraint of B=1 & C < 3.

SELECT S.A, S.B, S.C
FROM
 (SELECT A, B, C, row_number() over (partition by B,C) as r FROM record where B=1 AND C < 3) S
WHERE S.r < 3

This will give the results as desired. In this case, because B=1, B is really not needed in the partition by clause. But if the limits for B or C change, then will be needed. Additionally, if there is some other relationship between B, C or their values, can modify with a if or case. For example, if C could be 0, but wanted to group that with C=1, you could do

(partition by B, if (C < 2,0,1))

edited Oct 17 '13 at 14:54

answered Oct 14 '13 at 15:54

libjack

6,403
2
28
36

Maybe I didn't convey that right. After testing, I don't think this can achieve what I want. You said **how rows which could pass both conditions should appear**, that is just like some kind of **`group by`, in every group, the max number of records should be limited**. In that example, when (b=1 and c=1), there should be at most 2 record, and this is the same to (b=1 and c=2) – Judking Oct 15 '13 at 00:52
**P.S.** If I didn't describe that clear, please refer to http://stackoverflow.com/questions/10421807/how-do-i-limit-the-number-of-rows-per-field-value-in-sql?rq=1 ,this is a solution in MySQL, but I don't know how to do the same thing in hive.Thanks! – Judking Oct 15 '13 at 00:55
I think that reference is a different problem, it wants the first 2 rows for every distinct value in column 1; and I just posted a solution for that. – libjack Oct 16 '13 at 16:11
It's somewhat the same problem. I just intend to get top n records for every group in one hql. – Judking Oct 17 '13 at 02:08

how can HQL limit the number of records in every group?

1 Answers1