2

In Apache Hive I have to tables I would like to left-join keeping all the data from the left data and adding data where possible from the right table. For this I use two joins, because the join is based on two fields (a material_id and a location_id). This works fine with two traditional left joins:

SELECT 
   a.*, 
   b.*
FROM a
INNER JOIN (some more complex select) b
   ON a.material_id=b.material_id 
   AND a.location_id=b.location_id;

For the location_id the database only contains two distinct values, say 1 and 2.

We now have the requirement that if there is no "perfect match", this means that only the material_id can be joined and there is no correct combination of material_id and location_id (e.g. material_id=100 and location_id=1) for the join for the location_id in the b-table, the join should "default" or "fallback" to the other possible value of the location_id e.g. material_id=001 and location_id=2 and vice versa. This should only be the case for the location_id.

We have already looked into all possible answers also with CASE etc. but to no prevail. A setup like

...
ON a.material_id=b.material_id AND a.location_id=
CASE WHEN a.location_id = b.location_id THEN b.location_id ELSE ...;

we tried or did not figure out how really to do in hive query language.

Thank you for your help! Maybe somebody has a smart idea.

Here is some sample data:

Table a
| material_id | location_id | other_column_a |
| 100         | 1           | 45            |
| 101         | 1           | 45            |
| 103         | 1           | 45            |
| 103         | 2           | 45            |



Table b
| material_id | location_id | other_column_b |
| 100         | 1           | 66            |
| 102         | 1           | 76            |
| 103         | 2           | 88            |


Left - Join Table
| material_id | location_id | other_column_a | other_column_b
| 100         | 1           | 45            | 66
| 101         | 1           | 45            | NULL (mat. not in b)
| 103         | 1           | 45            | DEFAULT TO where location_id=2 (88)
| 103         | 2           | 45            | 88

PS: As stated here exists etc. does not work in the sub-query ON.

Community
  • 1
  • 1
alpcoder
  • 61
  • 1
  • 6

2 Answers2

0

The solution is to left join without a.location_id = b.location_id and number all rows in order of preference. Then filter by row_number. In the code below the join will duplicate rows first because all matching material_id will be joined, then row_number() function will assign 1 to rows where a.location_id = b.location_id and 2 to rows where a.location_id <> b.location_id if exist also rows where a.location_id = b.location_id and 1 if there are not exist such. b.location_id added to the order by in the row_number() function so it will "prefer" rows with lower b.location_id in case there are no exact matching. I hope you have caught the idea.

select * from 
(
SELECT 
   a.*, 
   b.*,
   row_number() over(partition by material_id 
                     order by CASE WHEN a.location_id = b.location_id THEN 1 ELSE 2 END, b.location_id ) as rn
FROM a
LEFT JOIN (some more complex select) b
   ON a.material_id=b.material_id 
)s 
where rn=1
;
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • The approach specified by you indeed worked, this is why I marked it helpful. However, our requirement has now changed for the location not to be hardcoded. I will post our current solution. – alpcoder Sep 29 '16 at 09:31
  • Could you please explain what is hardcoded in this code? Location_id values are not hardcoded and can be any – leftjoin Sep 29 '16 at 09:37
  • From my understanding the THEN 1 ELSE 2 is hardcoded, how would this work with more locations? BUT YES your answer is the answer to my initial question and works fine. Thank you very much! – alpcoder Sep 29 '16 at 09:41
  • Then I missunderstood this part of the solution. It is working then perfectly fine in our case and we will evaluate the performance of both approaches. Thank you again! – alpcoder Sep 29 '16 at 09:44
0

Maybe this is helpful for somebody in the future:

We also came up with a different approach.

First, we create another table to calculate averages from the table b based on material_id over all (!) locations.

Second, In the join table we create three columns: c1 - the value where material_id and location_id are matching (result from a left join of table a with table b). This column is null if there is no perfect match.

c2 - the value from the table where we write the number from the averages (fallback) table for this material_id (regardless of the location)

c3 - the "actual value" column where we use a case statement to decide if when the column 1 is NULL (there is no perfect match of material and location) then we use the value from column 2 (the average over all the other locations for the material) for the further calculations.

alpcoder
  • 61
  • 1
  • 6
  • THEN 1 ELSE 2 - these are not locations IDs. It's just values for ORDER in ROW_NUMBER() to mark rows. Rows marked with 1 will be ordered first, row marked with 2 will be ordered second, then filter at the end will filter rn=1. There is no any hardcoded column values. – leftjoin Sep 29 '16 at 09:42