2

I have this table.

   +------------------------------------------------------------+
   |     ks      |      time     |     val1      |    val2      | 
   +-------------+---------------+---------------+--------------+
   |     A       |       1       |       1       |      1       |
   |     B       |       1       |       3       |      5       |
   |     A       |       2       |       6       |      7       |
   |     B       |       2       |      10       |     12       |
   |     A       |       4       |       6       |      7       |
   |     B       |       4       |      20       |     26       |
   +------------------------------------------------------------+

What I want to get is for each row,

ks |  time |  val1 | val1 of next ts of same ks  |

To be clear, result of above example should be,

   +------------------------------------------------------------+
   |     ks      |      time     |     val1      |   next.val1  | 
   +-------------+---------------+---------------+--------------+
   |     A       |       1       |       1       |       6      |
   |     B       |       1       |       3       |       10     |
   |     A       |       2       |       6       |       6      |
   |     B       |       2       |      10       |       20     |
   |     A       |       4       |       6       |      null    |
   |     B       |       4       |      20       |      null    |
   +------------------------------------------------------------+

(I need the same next for value2 as well)

I tried a lot to come up with a hive query for this, but still no luck. I was able to write a query for this in sql as mentioned here (Quassnoi's answer), but couldn't create the equivalent in hive because hive doesn't support subqueries in select.

Can someone please help me achieve this?

Thanks in advance.

EDIT:

Query I tried was,

SELECT ks, time, val1, next[0] as next.val1 from
(SELECT ks, time, val1
       COALESCE(
       (
       SELECT Val1, time
       FROM myTable mi
       WHERE mi.val1 > m.val1 AND mi.ks = m.ks
       ORDER BY time
       LIMIT 1
       ), CAST(0 AS BIGINT)) AS next
FROM  myTable m
ORDER BY time) t2;
Community
  • 1
  • 1
Bee
  • 12,251
  • 11
  • 46
  • 73

2 Answers2

2

Your query seems quite similar to the "year ago" reporting that is ubiquitous in financial reporting. I think a LEFT OUTER JOIN is what you are looking for.

We join table myTable to itself, naming the two instances of the same table m and n. For every entry in the first table m we will attempt to find a matching record in n with the same ks value but an incremented value of time. If this record does not exist, all column values for n will be NULL.

SELECT 
    m.ks, 
    m.time,
    m.val1, 
    n.val1 as next_val1,
    m.val2, 
    n.val2 as next_val2
FROM 
    myTable m
LEFT OUTER JOIN
    myTable n
ON (
    m.ks = n.ks
AND 
    m.time + 1 = n.time
);

Returns the following.

ks  time  val1  next_val1  val2  next_val2
A   1     1     6          1     7
A   2     6     6          7     7
A   3     6     NULL       7     NULL
B   1     3     10         5     12
B   2     10    20         12    26
B   3     20    NULL       26    NULL

Hope that helps.

Lukas Vermeer
  • 5,920
  • 2
  • 16
  • 19
  • Hi Lukas, thank you for your reply, but my scenario is a little different. I'm sorry my example has mislead you. in this case, time is not necessarily incremented by 1 always. it can have larger gaps too. so we can't use `m.time + 1 = n.time`. Thanks. – Bee May 16 '13 at 18:55
  • 1
    In that case I would suggest you use a subquery to first order myTable "by ks, time" and then add row numbers. You can then use the row numbers in the join above, rather than time. Hope that helps. – Lukas Vermeer May 17 '13 at 07:54
  • I had this idea but I couldn't find a way to add a row number column. can you please tell me how to do that after ordering? – Bee May 18 '13 at 06:31
  • 1
    Row numbering is not natively supported in Hive, [but you could use a UDF to do this](http://stackoverflow.com/questions/9288578/how-can-i-add-row-numbers-for-rows-in-pig-or-hive). – Lukas Vermeer May 20 '13 at 10:50
2

I find that using Hive custom map/reduce functionality works great to solve queries similar to this. It gives you the opportunity to consider a set of input and "reduce" to one (or more) results.

This answer discusses the solution.

The key is that you use CLUSTER BY to send all results with similar key value to the same reducer, hence same reduce script, collect accordingly, and then output the reduced results when the key changes, and start collecting for the new key.

Community
  • 1
  • 1
libjack
  • 6,403
  • 2
  • 28
  • 36