12

For example, I have a table like this in Hive:

1 1
1 4
1 8
2 1
2 5
3 1
3 2

and I want to only return the first two rows of each unique value of the first column. I want this to be able to limit the amount of data that I transfer from Hive into MySQL for reporting purposes. I'd like a single HiveQL query that gives me this:

1 1
1 4
2 1
2 5
3 1
3 2
Andriy M
  • 76,112
  • 17
  • 94
  • 154
repalviglator
  • 2,196
  • 4
  • 16
  • 15
  • Don't these tables and columsn have names? – ypercubeᵀᴹ May 02 '12 at 22:12
  • 1
    Try searching this site by the [`greatest-n-per-group` + `mysql`](http://stackoverflow.com/questions/tagged/greatest-n-per-group+mysql?sort=votes&pagesize=50) tag combination and see if you can find a solution that fits your situation. – Andriy M May 02 '12 at 22:21
  • @repalviglator Hi, This solution in MySQL is great! But I want to do that in Hive. I see your tag includes hive, so could you tell me how to? Thanks a lot! – Judking Oct 15 '13 at 01:00

2 Answers2

8

Unluckily mysql doesn't have Analytical Functions. So you have to play with variables. Supposing you have an autoincrement field:

mysql> create table mytab (
    -> id int not null auto_increment primary key,
    -> first_column int,
    -> second_column int
    -> ) engine = myisam;
Query OK, 0 rows affected (0.05 sec)

mysql> insert into mytab (first_column,second_column)
    -> values
    -> (1,1),(1,4),(2,10),(3,4),(1,4),(2,5),(1,6);
Query OK, 7 rows affected (0.00 sec)
Records: 7  Duplicates: 0  Warnings: 0

mysql> select * from mytab order by id;
+----+--------------+---------------+
| id | first_column | second_column |
+----+--------------+---------------+
|  1 |            1 |             1 |
|  2 |            1 |             4 |
|  3 |            2 |            10 |
|  4 |            3 |             4 |
|  5 |            1 |             4 |
|  6 |            2 |             5 |
|  7 |            1 |             6 |
+----+--------------+---------------+
7 rows in set (0.00 sec)

mysql> select
    -> id,
    -> first_column,
    -> second_column,
    -> row_num
    -> from (
    -> select *,
    -> @num := if(@first_column = first_column, @num:= @num + 1, 1) as row_num,
    -> @first_column:=first_column as c
    -> from mytab order by first_column,id) as t,(select @first_column:='',@num:
=0) as r;
+----+--------------+---------------+---------+
| id | first_column | second_column | row_num |
+----+--------------+---------------+---------+
|  1 |            1 |             1 |       1 |
|  2 |            1 |             4 |       2 |
|  5 |            1 |             4 |       3 |
|  7 |            1 |             6 |       4 |
|  3 |            2 |            10 |       1 |
|  6 |            2 |             5 |       2 |
|  4 |            3 |             4 |       1 |
+----+--------------+---------------+---------+
7 rows in set (0.00 sec)

mysql> select
    -> id,
    -> first_column,
    -> second_column,
    -> row_num
    -> from (
    -> select *,
    -> @num := if(@first_column = first_column, @num:= @num + 1, 1) as row_num,
    -> @first_column:=first_column as c
    -> from mytab order by first_column,id) as t,(select @first_column:='',@num:
=0) as r
    -> having row_num<=2;
+----+--------------+---------------+---------+
| id | first_column | second_column | row_num |
+----+--------------+---------------+---------+
|  1 |            1 |             1 |       1 |
|  2 |            1 |             4 |       2 |
|  3 |            2 |            10 |       1 |
|  6 |            2 |             5 |       2 |
|  4 |            3 |             4 |       1 |
+----+--------------+---------------+---------+
5 rows in set (0.02 sec)
Nicola Cossu
  • 54,599
  • 15
  • 92
  • 98
  • 1) Doesn't work without `ORDER BY` clause. 2) `ORDER BY` column has to be a column you count. Otherwise this doesn't work. – Green Sep 25 '17 at 05:42
8

A Hive solution would be

SELECT S.col1, S.col2
FROM
 (SELECT col1, col2, row_number() over (partition by col1) as r FROM mytable) S
WHERE S.r < 3
libjack
  • 6,403
  • 2
  • 28
  • 36