0

I need to get the first and last record for a user if one of the key fields is different over time using a Hive table:

This is some sample data:

UserID  EntryDate   Activity
a3324   1/1/16  walk
a3324   1/2/16  walk
a3324   1/3/16  walk
a3324   1/4/16  run
a5613   1/1/16  walk
a5613   1/2/16  walk
a5613   1/3/16  walk
a5613   1/4/16  walk

And I'm looking for output preferably like this:

a3324   1/1/16  walk    1/4/16  run

Or at least like this:

a3324   walk    run

I start writing code like this:

SELECT UserID, MINIMUM(EntryDate), MAXIMUM(EntryDate), Activity
FROM
     SELECT UserID, DISTINCT Activity
     GROUP BY UserID
     HAVING Count(Activity) > 1

But I know that's not it.

I'd also like to be able to specify the cases where the original activity was Walk and the second activity was Run perhaps in the Where clause.

Can you help with an approach?

Thanks

Jazzmine
  • 1,837
  • 8
  • 36
  • 54

2 Answers2

0
SELECT
    t.UserId
    ,MIN(CASE WHEN t.RowNumAsc = 1 THEN t.EntryDate END) as MinEntryDate
    ,MIN(CASE WHEN t.RowNumAsc = 1 THEN t.Activity END) as MinActivity
    ,MAX(CASE WHEN t.RowNumDesc = 1 THEN t.EntryDate END) as MaxEntryDate
    ,MAX(CASE WHEN t.RowNumDesc = 1 THEN t.Activity END) as MaxActivity
FROM
    (
       SELECT
          UserId
          ,EntryDate
          ,Activity
          ,ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY EntryDate) as RowNumAsc
          ,ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY EntryDate DESC) as RowNumDesc
       FROM
          Table
    ) t
WHERE
    t.RowNumAsc = 1
    OR t.RowNumDesc = 1
GROUP BY
    t.UserId

Looks like window functions are supported (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics) so using 2 row numbers 1 for EntryDate Ascending and another for Descending with Conditional Aggregation should get you to the answer.

And if you don't want to use Analytic Functions (window functions) you can use self left joins and conditional aggregation:

SELECT
    t.UserId
    ,MIN(CASE WHEN mn.UserId IS NULL THEN t.EntryDate END) as MinEntryDate
    ,MIN(CASE WHEN mn.UserId IS NULL THEN t.Activity END) as MinActivity
    ,MAX(CASE WHEN mx.UserId IS NULL THEN t.EntryDate END) as MaxEntryDate
    ,MAX(CASE WHEN mx.UserId IS NULL THEN t.Activity END) as MaxActivity
FROM
    Table t
    LEFT JOIN Table mn
    ON t.UserId = mn.UserId
    AND t.EntryDate > mn.EntryDate
    LEFT JOIN Table mx
    ON t.UserId = mx.UserId
    AND t.EntryDate < mx.EntryDate
WHERE
    mn.UserId IS NULL
    OR mx.UserId IS NULL
GROUP BY
    t.UserId

Or a correlated Sub Query way:

SELECT
    UserId
    ,MIN(EntryDate) as MinEntryDate
    ,(SELECT
          Activity
       FROM
          Activity a
       WHERE
          u.UserId = a.UserId
          AND a.EntryDate = MIN(u.EntryDate)
       LIMIT 1
    ) as MinActivity
    ,MAX(EntryDate) as MaxEntryDate
    ,(SELECT
          Activity
       FROM
          Activity a
       WHERE
          u.UserId = a.UserId
          AND a.EntryDate = MAX(u.EntryDate)
       LIMIT 1
          ) as MaxActivity
FROM
    Activity u
GROUP BY
    UserId
Matt
  • 13,833
  • 2
  • 16
  • 28
  • Thanks I'm running the first one right now. Thanks for responding. – Jazzmine Dec 07 '16 at 17:16
  • The first one ended with this error: Invalid OperationHandle: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=ebbb29c8-f54e-4de1-b56b-1312b5508dbd] and the second one ended with this error: Error while compiling statement: FAILED: SemanticException [Error 10017]: Line 11:8 Both left and right aliases encountered in JOIN. This may explain the second error: http://stackoverflow.com/questions/25821166/error-in-hive-query-while-joining-tables – Jazzmine Dec 07 '16 at 17:26
  • @Jazzmine what is the environment? software version etc? I admit I am new to hive but these are ansi standard queries and the latter would work in just about every other system so this is a quirk specific to hive. We can try to move the ineuqal on conditions in the second one to see if it will pass for you. give me a minute to rethink doing that and I will update – Matt Dec 07 '16 at 17:38
  • @Jazzmine I added a correlated subquery way that shouldn't have either of those issues. The only problem you may have is you may need to add an order by statement in the subquery so if the EntryDate could be duplicated per user (e.g. ties) to get the specific activity you want. – Matt Dec 07 '16 at 17:58
0

You can use lag /lead to get a solution

 SELECT * FROM (
    select UserID  ,EntryDate ,  Activityslec, 
    lead(Activityslec, 1) over (UserID  ,EntryDate ) as nextActivityslec 
    from table) as A
 where Activityslec <> nextActivityslec
Ravinder Karra
  • 307
  • 1
  • 3
  • 8
  • Hi Ravinder, were you able to test this in hive, I like the concept but I'm getting an error around line 3. Thanks – Jazzmine Dec 07 '16 at 17:50
  • this is my sample data SQL SELECT * FROM ( select * , lead(order_item_quantity, 1) over (order by order_item_order_id , order_item_product_id ) as next from order_items where order_item_order_id in (68873,68860, 68842) ) as A WHERE order_item_quantity <> next ; Can you send me error message too and Table name – Ravinder Karra Dec 07 '16 at 18:07
  • Version hive/2.4.0.0-169/ – Ravinder Karra Dec 07 '16 at 18:18
  • @Jazzmine any luck or still having issue – Ravinder Karra Dec 14 '16 at 03:51