1

I'm looking for some help understanding this error I'm getting in BigQuery:

LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.

I'm trying to use a case statement to alter the rows selected for joining depending on a value in the left table's row. I am doing something similar in some other places and it works, so part of me thinks I may be making a mistake regarding table alias and column names, but I can't figure it out. Here is a minimal example of what I'm trying to do:

WITH t1 AS (
  SELECT "milk" AS dairy,
   1 AS id,
   2 AS other_id

   UNION ALL

   SELECT "yogurt" AS dairy,
   3 AS id,
   4 AS other_id

   UNION ALL

   SELECT "cheese" AS dairy,
   5 AS id,
   6 AS other_id
),

t2 AS (
  SELECT "blue" AS color,
  1 AS id

  UNION ALL

  SELECT "red" AS color,
  4 AS id
)

SELECT
  t1.*, t2
FROM t1
LEFT JOIN t2 ON
  CASE
    WHEN t1.dairy = 'milk' THEN t1.id = t2.id
    WHEN t1.dairy = 'yogurt' THEN t1.other_id = t2.id
  END

The result I would like to see is:

enter image description here

As you can see in the desired result, when the value for dairy is milk, I want the id from t2 to equal the id column in t1, but when the value for dairy is yogurt, I want the id from t2 to equal the other_id column in t1.

I've been searching around for an explanation but can't figure it out. I also tried the solution offered here, but got the same error, which is why I think I am simply messing something up with tables names or aliases.

Please help!

UPDATE

I was able to get rid of the error by rewriting the case statement this way:

SELECT
  t1.*, t2
FROM t1
LEFT JOIN t2 ON
  CASE
    WHEN t1.dairy = 'milk' THEN t1.id
    WHEN t1.dairy = 'yogurt' THEN t1.other_id
  END = t2.id

However, in my real problem I need to join a third table in a similar fashion. If t2.color is blue, I want to join based on t2.id = t3.id, but if t2.color is red I want to join based on t2.id = t3.other_id. As soon as I do that, the same error occurs. Here is the full example of my attempt:

WITH t1 AS (
  SELECT "milk" AS dairy,
   1 AS id,
   2 AS other_id

   UNION ALL

   SELECT "yogurt" AS dairy,
   3 AS id,
   4 AS other_id

   UNION ALL

   SELECT "cheese" AS dairy,
   5 AS id,
   6 AS other_id
),

t2 AS (
  SELECT "blue" AS color,
  1 AS id

  UNION ALL

  SELECT "red" AS color,
  4 AS id
),

t3 AS (
  SELECT "sunny" AS weather,
  1 AS id,
  10 AS other_id

  UNION ALL

  SELECT "cloudy" AS weather,
  11 AS id,
  4 AS other_id
)

SELECT
  t1.*, t2, t3
FROM t1
LEFT JOIN t2 ON
  CASE
    WHEN t1.dairy = 'milk' THEN t1.id
    WHEN t1.dairy = 'yogurt' THEN t1.other_id
  END = t2.id
LEFT JOIN t3 ON
  CASE
   WHEN t2.color = 'blue' THEN t3.id
   WHEN t2.color = 'red' THEN t3.other_id
  END = t2.id

But now the same error occurs:

LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.

If I remove the joining of t3, it works correctly. Here are some more images of the tables and desired result in case it helps:

enter image description here

flyingL123
  • 7,686
  • 11
  • 66
  • 135
  • Only a guess, but I would assume BQ doesn't like conditional joins because it can't accurately estimate the cost of your query (especially if your right-side table has partitioning/clustering). The solution provided below works, but another solution using basic principles would be to do 2 separate joins on each ID and then use a `case` statement in your `select` statement. – rtenha Feb 05 '20 at 21:09
  • @rtenha but why does it work with one conditional join? Something isn't making sense. – flyingL123 Feb 05 '20 at 22:04
  • Your scenario with 1 conditional join (the updated version) actually satisfies my theory above. The join is saying it will definitely need to scan t1 and t2. Your original join with t2 inside the `case` statement doesn't define how much of t2 will be scanned. – rtenha Feb 05 '20 at 23:41
  • @rtenha but what about the one with two joins? Shouldn’t that work too if your theory was correct? – flyingL123 Feb 05 '20 at 23:50
  • it is the same issue. t3 is inside the case statement, making it conditional. It is unable to estimate how much of t3 it will need to scan. – rtenha Feb 06 '20 at 16:49
  • @rtenha Sorry, I am not following. If it works with one join why would it not work when I add the second one if both joins are using the same type of conditional logic? – flyingL123 Feb 06 '20 at 20:07
  • In your query with both joins, you are definitely scanning `t1` because that is your `from` table. Then in your first join, you are definitely scanning `t2` because it is explicitly calling `t2.id` on the right-hand side of your equality, regardless of `t1` id logic. In the next join, you have conditional logic for `t3`, where you might need to join on `t3.id` or on `t3.other_id` (inside your case statement). BQ can't properly estimate how it will scan `t3` because of this. You would need to end with something 'definite' like `= t3.id` for the equality condition to be satisified. – rtenha Feb 06 '20 at 20:17
  • @rtenha I see. Thanks for the explanation. In other words, in the first query the condition is for a column in the left table. In the second query the condition is for a column in the table being joined (the right table), and that is causing the issue. – flyingL123 Feb 06 '20 at 20:27

2 Answers2

1

Below is for BigQuery Standard SQL

#standardSQL
SELECT *,
  ARRAY(
    SELECT AS STRUCT *  
    FROM t2 b
    WHERE b.id IN (a.id, a.other_id) 
    ORDER BY (
      CASE
        WHEN dairy IN ('milk', 'yogurt') THEN 1
        ELSE 2
      END    
    )
    LIMIT 1
  )[SAFE_OFFSET(0)] AS t2  
FROM t1 a  

If to apply to sample/dummy data from your question - result is

Row dairy   id  other_id    t2.color    t2.id    
1   milk    1   2           blue        1    
2   yogurt  3   4           red         4    
3   cheese  5   6           
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
1

I was able to answer your updated question with 3 tables by breaking the joins and associated logic into separate CTEs.

WITH t1 AS (
  SELECT "milk" AS dairy, 1 AS id, 2 AS other_id UNION ALL
  SELECT "yogurt", 3, 4 UNION ALL
  SELECT "cheese", 5, 6
),
t2 AS (
  SELECT "blue" AS color, 1 AS id UNION ALL
  SELECT "red", 4
),
t3 AS (
  SELECT "sunny" AS weather, 1 as id, 10 as other_id UNION ALL
  SELECT "cloudy", 11, 4
),
join_t1_t2 as (
  select
    t1.*,
    case 
      when t1.dairy = 'milk' then milk.color
      when t1.dairy = 'yogurt' then yogurt.color
      else null
    end as t2_color,
    case 
      when t1.dairy = 'milk' then milk.id
      when t1.dairy = 'yogurt' then yogurt.id
      else null
    end as t2_id
  from t1
  left join t2 milk on t1.id = milk.id
  left join t2 yogurt on t1.other_id = yogurt.id
),
join_t1_t2_t3 as (
  select
    join_t1_t2.*,
    case 
      when t2_color = 'blue' then blue.id
      when t2_color = 'red' then red.id
      else null
    end as t3_id,
    case 
      when t2_color = 'blue' then blue.other_id
      when t2_color = 'red' then red.other_id
      else null
    end as t3_other_id,
    case 
      when t2_color = 'blue' then blue.weather
      when t2_color = 'red' then red.weather
      else null
    end as t3_weather,
  from join_t1_t2
  left join t3 blue on t2_id = blue.id
  left join t3 red on t2_id = red.other_id
)
select * from join_t1_t2_t3
rtenha
  • 3,349
  • 1
  • 6
  • 19