-1

I have the sample data as

user_id, date, accessed url, session time
the data refers to the top 3 interests of the user depending on the session time.

Got the data using the code:

top3 =  FOREACH DataSet{
    sorted = ORDER DataSet BY sessiontime DESC;
    lim    = LIMIT sorted 3;
    GENERATE flatten(group), flatten(lim);
};

Output:

    (1,20,url1,2484)
    (1,20,url2,1863)
    (1,20,url3,1242)
    (2,22,url4,484)
    (2,22,url5,63)
    (2,22,url6,42)
    (3,25,url7,500)
    (3,25,url8,350)
    (3,25,url9,242)

But I want my output to be like this:

(1,20,url1,url2,url3)
(2,22,url4,url5,url6)
(3,25,url7,url8,url9)

Please help.

Albert Laure
  • 1,702
  • 5
  • 20
  • 49
sravani malla
  • 47
  • 2
  • 7
  • Please provide a sample of the input, as well as the code that you used to group the sets. Without that it is difficult to tell what is going on. – Davis Broda Oct 17 '13 at 13:41

1 Answers1

0

You are close. The problem is that you FLATTEN the bag of URLs when you really want to keep them all in one record. So do this instead:

top3 =  FOREACH DataSet{
    sorted = ORDER DataSet BY sessiontime DESC;
    lim    = LIMIT sorted 3;
    GENERATE flatten(group), lim.url;
};

Based on the output you got, you will now get

(1,20,{(url1),(url2),(url3)})
(2,22,{(url4),(url5),(url6)})
(3,25,{(url7),(url8),(url9)})

Note that the URLs are contained inside a bag. If you want to have them as three top-level fields, you will need to use a UDF to convert a bag into a tuple, and then FLATTEN that.

reo katoa
  • 5,751
  • 1
  • 18
  • 30