Finding top 10 trending tweets in Hive

Question

I am finding top 10 trending tweets in hive on basis of retweet_count ie The tweet which has highest retweet_count will be 1st and so-on....

Here is election Table details

id                      bigint                  from deserializer   
created_at              string                  from deserializer   
source                  string                  from deserializer   
favorited               boolean                 from deserializer   
retweeted_status        struct<text:string,user:struct<screen_name:string,name:string>,retweet_count:int>   from deserializer   
entities                struct<urls:array<struct<expanded_url:string>>,user_mentions:array<struct<screen_name:string,name:string>>,hashtags:array<struct<text:string>>> from deserializer   
text                    string                  from deserializer   
user                    struct<screen_name:string,name:string,friends_count:int,followers_count:int,statuses_count:int,verified:boolean,utc_offset:int,time_zone:string,location:string>    from deserializer   
in_reply_to_screen_name string                  from deserializer

my query

select text 
from election 
where retweeted_status.retweet_count IN  
     (select  retweeted_status.retweet_count as zz 
      from election  
      order by zz desc  
      limit 10);

It is returning me the same tweet 10 times. (TWEET-ABC, TWEET-ABC, TWEET-ABC, . . . TWEET-ABC)

So what I did is break the nested query , when I run inner query

select  retweeted_status.retweet_count as zz 
from election  
order by zz desc  
limit 10

It return 10 different values (1210,1209,1208,1207,1206,....1201)

Afterwards when I run my outer query

select text 
from election  
where retweeted_status.retweet_count 
      IN  (1210,1209,1208,1207,1206,....1201 );

The results are same 10 tweets (TWEET-ABC, TWEET-ABC, TWEET-ABC, . . . TWEET-ABC)

Whats wrong in my query logic ?

Im not familiar with tweets and re-tweets... but maybe you have most retweet `{tweet-abc, retweet 1209}`, then you retweet again you have `{tweet-abc, retweet 1210}` so you most retweet is the same... just is \growing — Juan Carlos Oropeza, Jul 29 '16 at 04:51

Juan Carlos Oropeza · Answer 1 · 2016-07-29T04:53:21.387

0

Instead of using count you should use id. that is because if you have 100 tweet with same count not matter the LIMIT 10 you will get 100 records.

select text 
from election 
where id  IN  
     (select  id as zz 
      from election  
      order by retweeted_status.retweet_count desc  
      limit 10);

but still not sure why are you getting the wrong result.

EDIT (after my comment):

if my comment is correct then you will have the same id ten times. In that case change to

     (select distinct id as zz 
      from election  
      order by retweeted_status.retweet_count desc  
      limit 10);

edited Jul 29 '16 at 04:53

answered Jul 29 '16 at 04:48

Juan Carlos Oropeza

47,252
12
78
118

select id from election order by retweeted_status.retweet_count desc limit 10; It is giving me an error "SemanticException [Error 10004]: Line 1:38 Invalid table alias or column reference 'retweeted_status': (possible column names are: id)" – rUCHIt Jul 29 '16 at 05:14
select * from election order by retweeted_status.retweet_count desc limit 10; is working but select id from election order by retweeted_status.retweet_count desc limit 10; is not working error is "Invalid table alias or column reference 'retweeted_status': (possible column names are: id)" – rUCHIt Jul 29 '16 at 05:23
The logic is there. I dont know how hive handle complex object or your model. Please check your query. – Juan Carlos Oropeza Jul 29 '16 at 15:53
select * from election order by retweeted_status.retweet_count ; works wel .select id from election order by retweeted_status.retweet_count: gives error ""Invalid table alias or column reference 'retweeted_status': (possible column names are: id)"" – rUCHIt Jul 29 '16 at 17:49
again you know your schema, you say there is one `id` column there. and you understand my comment about why is the same tweet as the most retweeted? – Juan Carlos Oropeza Jul 29 '16 at 18:05
select text from election where id IN ( select distinct a.id from (select id,retweeted_status.retweet_count from election order by retweet_count desc limit 10) as a ) ; This query also returns me all same result – rUCHIt Jul 30 '16 at 01:23

Finding top 10 trending tweets in Hive

1 Answers1