3

This is the below table

CREATE TABLE IF NOT EXISTS TestingTable1 
( 
BUYER_ID BIGINT,
ITEM_ID BIGINT, 
CREATED_TIME STRING
)

And this is the below data in the above table-

BUYER_ID    |    ITEM_ID       |    CREATED_TIME
------------+------------------+-----------------------
1015826235      220003038067        2012-07-09 19:40:21,
1015826235      300003861266        2012-07-09 18:19:59,
1015826235      140002997245        2012-07-09 09:23:17,
1015826235      210002448035        2012-07-09 22:21:11,
1015826235      260003553381        2012-07-09 07:09:56,
1015826235      260003553382        2012-07-09 19:40:39,
1015826235      260003553383        2012-07-09 06:58:47,
1015826235      260003553384        2012-07-09 07:28:47,
1015826235      260003553385        2012-07-09 08:48:47,
1015826235      260003553386        2012-07-09 06:38:47,
1015826235      260003553387        2012-07-09 05:38:47,
1015826235      260003553388        2012-07-09 04:55:47,
1015826235      260003553389        2012-07-09 06:54:37,
34512201        597245693           2012-07-09 16:20:21,
34512201        8071787728          2012-07-09 15:19:59,
34512201        5868222883          2012-07-09 08:23:17,
34512201        2412180494          2012-07-09 22:21:11,
34512201        2422054205          2012-07-09 06:09:56,
34512201        1875744030          2012-07-09 19:40:39,
34512201        5639158173          2012-07-09 06:58:47,
34512201        5656232360          2012-07-09 07:28:47, 
34512201        959188449           2012-07-09 08:48:47,
34512201        4645350592          2012-07-09 06:38:47,
34512201        5657320532          2012-07-09 05:38:47,
34512201        290419656539        2012-07-09 04:55:47,

If you see the above data in the table, there are only two UNIQUE BUYER_ID and corresponding to those I have ITEM_ID AND CREATED_TIME. I need only 10 latest record basis on the time for the day before today's date whenever I will be firing this query (meaning yesterday's date) for each BUYER_ID.

So for this BUYER_ID - 34512201 I need 10 latest record for each BUYER_ID basis on CREATED_TIME for yesterday's date only.

And each BUYER_ID can have any day's data. But I am specifically interested for day before today's data(means yesterday's date always) by checking at the CREATED_TIME

Find TOP 10 latest data for each BUYER_ID for yesterday's date. Below is the sample output I should be getting corresponding to each BUYER_ID.

Sample Output.

BUYER_ID    |    ITEM_ID       |    CREATED_TIME
------------+------------------+-----------------------
34512201        2412180494          2012-07-09 22:21:11
34512201        1875744030          2012-07-09 19:40:39
34512201        597245693           2012-07-09 16:20:21
34512201        8071787728          2012-07-09 15:19:59
34512201        959188449           2012-07-09 08:48:47
34512201        5868222883          2012-07-09 08:23:17
34512201        5656232360          2012-07-09 07:28:47 
34512201        5639158173          2012-07-09 06:58:47
34512201        4645350592          2012-07-09 06:38:47
34512201        2422054205          2012-07-09 06:09:56
1015826235      210002448035        2012-07-09 22:21:11
1015826235      260003553382        2012-07-09 19:40:39
1015826235      220003038067        2012-07-09 19:40:21
1015826235      300003861266        2012-07-09 18:19:59
1015826235      140002997245        2012-07-09 09:23:17
1015826235      260003553385        2012-07-09 08:48:47
1015826235      260003553384        2012-07-09 07:28:47
1015826235      260003553381        2012-07-09 07:09:56
1015826235      260003553383        2012-07-09 06:58:47
1015826235      260003553389        2012-07-09 06:54:37

I am working with Hive and Hive supports SQL like syntax. So I need to make sure the SQL should work in Hive environment too.

Can anyone help me with this?

Update:-

I am using the below query and I need to get top 10 latest from the below query and need to add one more qualifier for date check, means in where clause for yesterday's date- I cannot use TOP 10 here as Hive doesn't support TOP 10 sql syntax. I need some other way to do this problem.

SELECT * FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC;

One More UPDATE:-

I wrote this below query with the use of RANK UserDefinedFunction.

SELECT buyer_id, item_id, created_time, rk
FROM (
    SELECT buyer_id, item_id, rank(item_id) as rk, created_time
    FROM testingtable1
    DISTRIBUTE BY buyer_id, item_id
    SORT BY buyer_id, item_id, created_time desc
) a 
WHERE rk < 10
ORDER BY buyer_id, created_time, rk;

And this is the RANK UDF functions in Java-

package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

    public final class Rank extends UDF{
        private int  counter;
        private String last_key;
        public int evaluate(final String key){
          if ( !key.equalsIgnoreCase(this.last_key) ) {
             this.counter = 0;
             this.last_key = key;
          }
          return this.counter++;
        }
    }

And above query is not work the way I wanted to, some sort of twist has to be made I guess in that query.

Is there any way to do this without using any UDF in HiveQL?

arsenal
  • 23,366
  • 85
  • 225
  • 331
  • 2
    You should say specifically which SQL Language you're using. – mawburn Jul 10 '12 at 01:33
  • 1
    I am working with Hive and Hive supports SQL Like syntax, so that is the reason I tag this question as `SQL`. I think oracle syntax will work fine here. – arsenal Jul 10 '12 at 01:37
  • @Bohemian, Any idea how this can be done? – arsenal Jul 10 '12 at 01:53
  • I don't know how the Rank function works in this DBMS, but the `DISTRIBUTE BY` clause looks like `PARTITION BY` in SQL Server, so I would guess you need to remove the `item_id` column from that clause. – ErikE Jul 10 '12 at 22:52
  • RANK is not a function, it's a UDF that I have created, you can see my edit that I just did in my question. By looking at the question, you will understand what I am talking about. Can you provide me the exact query that you just said to me. As that way I will understand more. – arsenal Jul 10 '12 at 22:53
  • Note that this answer has a **BUG**, you must use another subquery in which you use DISTRIBUTE/SORT BY and use rank() from outer query. See http://stackoverflow.com/a/15805396/191069 – Hai-Anh Trinh Apr 04 '13 at 07:50

3 Answers3

2
  SELECT FIRST 10 *
    FROM TestingTable1
   WHERE buyer_id = 34512201
ORDER BY created_time DESC;
Joe R.
  • 2,032
  • 4
  • 36
  • 72
  • Thanks Frank for commenting out. I am not specific to that `buyer_id`. It can be many. And I don't need First 10. I need latest for that particular day. By latest means last 10 time for that particular `BUYER_ID` in desc order – arsenal Jul 10 '12 at 01:53
  • Since you gave that example, I used it for my example :) .. well, by ORDERing created_time in DESCending, the FIRST 10 would be the latest 10 rows? ..ahh, why is created_time not defined as a DATETIME datatype?.. and if you want a previous days results, you can add a qualifier to the WHERE clause. – Joe R. Jul 10 '12 at 01:58
  • I have no idea why they didn't defined as `DATETIME` datatype. I am working on someone else design. They have defined that as a string. So I have to use that I guess. I have also posted sample output. – arsenal Jul 10 '12 at 02:01
  • then you can cast that string to a DATETIME. The STRING is in proper DATETIME YEAR TO SECOND format! – Joe R. Jul 10 '12 at 02:02
  • And how can I enable this to all the `BUYER_ID` as currently it has enabled for only one. Can you provide me the example which can work in my scenario. it will be of great help to me. And `FIRST 10.*` will not work with Hive, as I am working in Hive and Hive supports SQL like syntax. – arsenal Jul 10 '12 at 02:05
  • I'm not familiar with Hive, this example works for Informix, however not much different for other DB's.. remove buyer_id qualifier from the WHERE clause and ORDER BY buyer_id, created_time DESC. – Joe R. Jul 10 '12 at 02:08
  • I tried using that and I am working with Hive and `FIRST 10.*` doesn't working in that. Is there any other way apart from that? – arsenal Jul 10 '12 at 02:10
  • not sure.. there must be some equivalent for obtaining the FIRST/LAST 10 rows of the result set in hive?.. perhaps some other ORM, like Hibernate, provides examples of how to obtain FIRST 10 rows? – Joe R. Jul 10 '12 at 02:12
  • I just did the searching, but couldn't found out anything. – arsenal Jul 10 '12 at 02:13
  • I think using TOP 10 will do the trick! http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/ – Joe R. Jul 10 '12 at 02:15
  • I tried using this `SELECT TOP 10 FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC;` and it doesn't works. – arsenal Jul 10 '12 at 02:17
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/13642/discussion-between-frank-computer-and-rjchar) – Joe R. Jul 10 '12 at 02:31
  • Could HQL LIMIT 10 or using COUNT work?.. see https://karmasphere.com/20101004185/Karmasphere-Analyst/hive-queries-on-table-data#syntax_of_sort_by – Joe R. Jul 10 '12 at 03:18
  • LIMIT will restrict the whole query to 10 but I need TOP 10 for each BUYER_ID. That's is causing more problem. – arsenal Jul 10 '12 at 04:01
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/13644/discussion-between-frank-computer-and-rjchar) – Joe R. Jul 10 '12 at 05:23
  • @Frank I think you're missing something here.. The asker wants the top 10 for EACH GROUP. Not just a single group. – ErikE Jul 10 '12 at 22:53
0

I am late answering this and I am sure you must be knowing the use of row_number function with Hive. Just an addition as a reference to previously good discussion.

select * from
(select buyer_id,item_id,created_time, row_number() over(partition by buyer_id over
created_time asc) row_num from yourtable)tab
where tab.row_num<=5;
vikrant rana
  • 4,509
  • 6
  • 32
  • 72
0
select * 
from (select buyer_id,item_id,created_time,row_num() over (partition by buyer_id order by created_time DESC)) a 
where a.row_num<=10
Dharman
  • 30,962
  • 25
  • 85
  • 135
Sanskar Suman
  • 63
  • 1
  • 12
  • 1
    Please add some commentary to your answer. *Why* will this work, and how is it different than the other answers. Also, maybe you didn't realize, but this question is 7 years old. – Gabriel Luci Sep 04 '19 at 17:36