2

I have the below data:

================================================================
session_id                       screen_name  screen_launch_time
================================================================
990004916946605-1404157897784    screen1      1404157898275
990004916946605-1404157897784    screen2      1404157898337
990004947764274-1435162269418    screen1      1435162274044
990004947764274-1435162269418    screen3      1435162274081

I would like to use an array_agg function to get my data in the below format:

=========================================================
session_id                      screen_flow         count
=========================================================
990004916946605-1404157897784   screen1->screen2    1
990004947764274-1435162269418   screen1->screen3    1

Has anyone tried writing a UDAF or python script to implement the logic used in an array_agg function?

Kindly share your thoughts.

o-90
  • 17,045
  • 10
  • 39
  • 63
explorethis
  • 61
  • 2
  • 9
  • 1
    Hive has a built in `collect_set()` and `collect_list()` which aggregates items to an array. There is a udf here that does the same https://github.com/klout/brickhouse/tree/master/src/main/java/brickhouse/udf/collect – o-90 Sep 18 '15 at 14:47
  • Hi, it gives me this error: – explorethis Sep 18 '15 at 15:17
  • FAILED: ParseException line 1:0 character '' not supported here – explorethis Sep 18 '15 at 15:17
  • 1
    I have absolutely no idea what you are trying to do because you asked a very generic question with zero code. If you need help with code, it would be beneficial to post said code and re-state your problem in a more specific manner. – o-90 Sep 18 '15 at 15:24
  • I have a table with columns -[ session_id string, screens string, screen_launch_time bigint] sample values: – explorethis Sep 18 '15 at 16:23
  • I have a table with columns -[ session_id string, screens string, screen_launch_time bigint] sample values: 990004916946605-1404157897784 screen1 1404157898275 990004916946605-1404157897784 screen2 1404157898337 990004947764274-1435162269418 screen1 1435162274044 990004947764274-1435162269418 screen3 1435162274081 – explorethis Sep 18 '15 at 16:29
  • Now, I want to have the above data in below format: session screen_flow count 990004916946605-1404157897784 screen1->screen2 1 (this count is for all the sessions having the same screen flow) 990004947764274-1435162269418 screen1->screen3 1 (this count is for all the sessions having the same screen flow) Do I make sense now :) ? – explorethis Sep 18 '15 at 16:30
  • You need to hit `edit` and type all that into your question. – o-90 Sep 18 '15 at 17:29
  • I added in my question :) – explorethis Sep 18 '15 at 18:39

2 Answers2

3

Just group by the session_id, concat screen_name, and count the records per group. If you don't want to build the brickhouse jar, you can use collect_list() instead of collect() (but I don't recommend it).

Query:

add jar /path/to/jars/brickhouse-0.7.1.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";

select session_id, screen_flow
  , count(*) count
from (
  select session_id
    , concat_ws('->', collect(screen_name)) screen_flow
  from db.table
  group by session_id ) x
group by session_id, screen_flow

Output:

990004916946605-1404157897784   screen1->screen2    1
990004947764274-1435162269418   screen1->screen3    1
o-90
  • 17,045
  • 10
  • 39
  • 63
  • Good one GoBrewers.. I also need to do 2 other things: 1. remove duplicate screens (if any) after grouping by session id 2. rank them based on max counts. Can you help on this? – explorethis Sep 20 '15 at 05:17
  • 1
    If my answer solves your original problem, mark as correct and then if you have additional questions, ask a new question and I will certainly help if I can. – o-90 Sep 20 '15 at 05:21
  • hi GoBrewers, here is the link for new question - http://stackoverflow.com/questions/32681157/how-to-find-the-pathing-flow-and-rank-them-using-pig-or-hive – explorethis Sep 20 '15 at 15:41
  • I would love to hear the reasoning as to why my answer wasn't marked as correct, especially after the effort I put into helping you on this question. – o-90 Sep 20 '15 at 15:41
  • Your answer was correct GoBrewer, I clicked on the tick mark.. Is there a different way to mark it as correct ? – explorethis Sep 20 '15 at 15:54
  • GoBrewer, this is my first attempt to post question, so help me understand how to mark your answer as correct :) – explorethis Sep 20 '15 at 16:39
  • GoBrewer, did you get my reply? – explorethis Sep 20 '15 at 19:16
  • GoBrewer, any thoughts on how we can remove subsequent screen names if they are same and ranking? I'm impressed with your approach and Your hlep is much appreciated. – explorethis Sep 21 '15 at 15:56
1

Input :-

990004916946605-1404157897784,screen1,1404157898275
990004916946605-1404157897784,screen2,1404157898337
990004947764274-1435162269418,screen1,1435162274044
990004947764274-1435162269418,screen3,1435162274081

Below is answer in Pig Style..

records = LOAD '/user/user/inputfiles/session_id.txt' USING PigStorage(',') AS (session_id:chararray,screen_name:chararray,screnn_launch_time:chararray);

rec_grped = GROUP records BY session_id;

rec_each =  FOREACH rec_grped 
                     {
                       rec_inner_each = FOREACH records GENERATE screen_name;

                       GENERATE group as session_id, REPLACE(BagToString(rec_inner_each),'_','-->') as screen_flow, 1 as cnt;
};

dump rec_each; 

output :-

990004916946605-1404157897784   screen1-->screen2    1
990004947764274-1435162269418   screen1-->screen3    1
Surender Raja
  • 3,553
  • 8
  • 44
  • 80
  • Thanks Surender. Good one. I also need to do 2 other things: 1. remove duplicate screens (if any) after grouping by session id 2. rank them based on max counts. Can you help on this? – explorethis Sep 20 '15 at 05:16
  • Ok.. For duplicate scenario, just give me input and expected output.. Also it would be good if you ask it as a separate question..thanks – Surender Raja Sep 20 '15 at 07:55
  • Hi Surender, I have posted it as a separate question - – explorethis Sep 20 '15 at 15:40
  • link for the separate question: http://stackoverflow.com/questions/32681157/how-to-find-the-pathing-flow-and-rank-them-using-pig-or-hive – explorethis Sep 20 '15 at 15:40
  • This answer is incorrect. Grouping to a bag doesn't preserve order. – o-90 Sep 20 '15 at 15:49
  • @GoBrewers I executed that code above,It s been tested and it gives the above pasted output .Could you explain little detail about the issue – Surender Raja Sep 20 '15 at 16:10
  • 1
    When u group and create a bag, its contents may (but will not necessarily) be in order. So this code works on this toy example but could potentially produce incorrect output. As in you could have a scenario where the user went `screen1 -> screen2` but the bag rearranges then and it spits out `screen2 -> screen1`. This is incorrect. My solution uses a UDF that preserves the order of the items being collected. – o-90 Sep 20 '15 at 16:27
  • 1
    Correct, I Will try to fix this in another question posted by same user – Surender Raja Sep 20 '15 at 16:35
  • Good catch GoBrewer :) – explorethis Sep 20 '15 at 16:38
  • @SurenderRaja will you be able to help me with a pig script for the hive query given by go brewer in below link http://stackoverflow.com/questions/32681157/how-to-find-the-pathing-flow-and-rank-them-using-pig-or-hive/32699912#32699912 – explorethis Sep 29 '15 at 03:02
  • @SurenderRaja - Will you be able to help me with a pig script for the hive query given by go brewer in below link pls ? http://stackoverflow.com/questions/32681157/how-to-find-the-pathing-flow-and-rank-them-using-pig-or-hive/32699912#32699912 – explorethis Sep 29 '15 at 03:04
  • @Surender Raja - did you have a chance to see my comment? – explorethis Sep 29 '15 at 13:45
  • just 5 mins.. but I cannot produce the result in pig without JAVA UDF. I will give you solution Pig with JAVA UDF – Surender Raja Sep 29 '15 at 14:05