0

In Hive, I have two tables as shown below:

SELECT * FROM p_test; OK p_test.id p_test.age 01 1 02 2 01 10 02 11 Time taken: 0.07 seconds, Fetched: 4 row(s)

SELECT * FROM p_test2; OK p_test2.id p_test2.height 02 172 01 170 Time taken: 0.053 seconds, Fetched: 2 row(s)

I'm supposed to get the age differences between the same user in the p_test table. Hence, I run HiveQL via row_number function as following:

SELECT * FROM (SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t1 LEFT JOIN (SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t2 ON t2.id=t1.id AND t1.rn=(t2.rn+1) LEFT JOIN (SELECT * FROM p_test2) t_2 ON t_2.id = t1.id;

The result of it is :

t1.id t1.age t1.rn t2.id t2.age t2.rn t_2.id t_2.height 01 1 1 NULL NULL NULL 01 170 01 10 2 01 1 1 01 170 02 11 1 NULL NULL NULL 02 172 02 2 2 02 11 1 02 172 Time taken: 60.773 seconds, Fetched: 4 row(s)

It is all ok so far. However, If I move the condition which left joins table t1 and table t2 shown above to the last line as shown below:

SELECT * FROM (SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t1 LEFT JOIN (SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t2 LEFT JOIN (SELECT * FROM p_test2) t_2 ON t_2.id = t1.id AND t2.id=t1.id AND t1.rn=(t2.rn+1);

I got the unexpected result as following:

t1.id t1.age t1.rn t2.id t2.age t2.rn t_2.id t_2.height 01 1 1 01 1 1 NULL NULL 01 1 1 01 10 2 NULL NULL 01 1 1 02 11 1 NULL NULL 01 1 1 02 2 2 NULL NULL 01 10 2 01 1 1 01 170 01 10 2 01 10 2 NULL NULL 01 10 2 02 11 1 NULL NULL 01 10 2 02 2 2 NULL NULL 02 11 1 01 1 1 NULL NULL 02 11 1 01 10 2 NULL NULL 02 11 1 02 11 1 NULL NULL 02 11 1 02 2 2 NULL NULL 02 2 2 01 1 1 NULL NULL 02 2 2 01 10 2 NULL NULL 02 2 2 02 11 1 02 172 02 2 2 02 2 2 NULL NULL

It seems that the condition which I move to the last line doesn't work anymore. It bothers me for a long time. Do hope I can hear any valuable answers, thx for anyone who provides me with answers in advance.

Adam Bacon
  • 55
  • 2
  • 8

1 Answers1

0

In your second query LEFT JOIN with t2 without ON condition is transformed to CROSS JOIN. This is why you have duplication. For id=01 you have two rows in subquery t1 and 2 rows in t2 initially, this CROSS join gives you 2x2=4 rows.

And the ON condition works, but it is applied only to the last LEFT join with t_2 subquery, this condition is being checked only to determine which rows to join in the last join, not all joins, it does not affect first CROSS JOIN (LEFT JOIN without ON condition) at all.

Every join should have it's own ON condition, except cross joins.

See also this answer about joins without ON condition behavior: https://stackoverflow.com/a/46843832/2700344

BTW you can do the same without t2 join at all using lag or lead analytic functions for calculating values ordered by age.

Like this:

lag(height) over(partition by id order by age) -- to get previous height

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thx so much for your valuable answer, it does help me a lot. The last question, in the second query, if I change `LEFT JOIN` to `JOIN`, it turns out that the condition locates in the last line applied to t2 causing `inner join` instead of `cross join`. As it only returns 2 rows instead of 16 rows as shown above. Do you have any ideas about this? – Adam Bacon Jan 10 '19 at 03:53
  • **HQL**: `SELECT * FROM (SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t1 JOIN (SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t2 JOIN (SELECT * FROM p_test2) t_2 ON t_2.id = t1.id AND t2.id=t1.id AND t1.rn=(t2.rn+1);` – Adam Bacon Jan 10 '19 at 04:00
  • **Result**: t1.id t1.age t1.rn t2.id t2.age t2.rn t_2.id t_2.height 01 10 2 01 1 1 01 170 02 2 2 02 11 1 02 172 Time taken: 91.713 seconds, Fetched: 2 row(s) – Adam Bacon Jan 10 '19 at 04:02
  • @AdamBacon JOIN is INNER by default if not specified.Also join without condition is CROSS, if condition is in where it filters rows and finally it works as inner because does not allows nulls. Add OR t2.rn is NULL and it should work as LEFT join – leftjoin Jan 10 '19 at 05:47