0

I have two queries running in BQ. I expect them having the same results, but it turns out not. I don't understand why it happens.

Query 1:

select a.* from
(
  select *, _PARTITIONTIME as pt from
  `table_A`
  where 
  p > 5 and q != 0
) as a
left join
(select distinct x, pt from `table_b`) as b
on join_clauses

Query 2:

select a.* from
table_A as a
left join
(select distinct x, pt from table_b) as b
on join_clauses
and a.p > 5 and a.q != 0

If I remove the conditions of a.p > 5 and a.q != 0 from both queries, the results are identical.

As far as I know, the conditions under the on statements is evaluated before the join, which should make the two queries equal. Then I don't know why do I see different results here. Is there any special setup in Google's BQ?

user2830451
  • 2,126
  • 5
  • 25
  • 31
  • 1
    What do you mean by "under the on statements"? How could the join condition be evaluated prior to the join...? – Elliott Brossard Sep 16 '19 at 05:27
  • Its because you are adding the `and a.p > 5 and a.q != 0` as a join constraint ( as compared to a `where` clause) in the second query. You should keep this constraint under a `where` clause too. – khan Sep 16 '19 at 06:01
  • Learn what LEFT JOIN ON returns: INNER JOIN ON rows UNION ALL unmatched left table rows extended by NULLs. Always know what INNER JOIN you want as part of an OUTER JOIN. PS A WHERE or INNER JOIN ON that requires a right [sic] table column to be not NULL after an OUTER JOIN ON removes any rows extended by NULLs, ie leaves only INNER JOIN ON rows, ie "turns OUTER JOIN into INNER JOIN". PS Saying why it happens requires giving & using operator definitions. Do that & tell us where you get a subexpression result you don't expect. A zillion presentations of joins are out there including on SO. – philipxy Sep 16 '19 at 07:35
  • Possible duplicate of [SQL join: where clause vs. on clause](https://stackoverflow.com/questions/354070/sql-join-where-clause-vs-on-clause) – philipxy Sep 16 '19 at 07:44
  • Please in code questions give a [mre]--cut & paste & runnable code; example input (as initialization code) with desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. That includes the least code you can give that is code that you show is OK extended by code that you show is not OK. (Debugging fundamental.) For SQL that includes DBMS/product & DDL, which includes constraints & indexes & tabular-formatted base table initialization. PS [What joins are.](https://stackoverflow.com/a/57767013/3404097) – philipxy Sep 16 '19 at 08:07

2 Answers2

1

In your Query#2 - result will contain ALL rows from table_A being left join to second sub-query. So, for each row in table_a output will have as many respective rows as many matches of on join_clauses and a.p > 5 and a.q != 0 exists in second sub-query - BUT at least one (in case if no matches for that row - this is because of LEFT JOIN used)

In Query #1 - first you filter out from table_a all rows which don't match WHERE clause p > 5 and q != 0 and only then you apply on join_clauses so you start with subset of rows in table_a thus the final output has lower rows than for Query #2

Run these two queries which are simplified version of yours to see this

Analogue of Query #2

#standardSQL
WITH table_a AS (
  SELECT 1 id, 1 p, 0 q UNION ALL
  SELECT 1, 6, 1 UNION ALL
  SELECT 2, 7, 2
), table_b AS (
  SELECT 1 id UNION ALL
  SELECT 2
)
SELECT a.*, b.id idb 
FROM table_a a
LEFT JOIN table_b b
ON a.id = b.id
AND a.p > 5 AND a.q != 0  

with result

Row id  p   q   idb  
1   1   1   0   null     
2   1   6   1   1    
3   2   7   2   2      

and analogue of Query #1

#standardSQL
WITH table_a AS (
  SELECT 1 id, 1 p, 0 q UNION ALL
  SELECT 1, 6, 1 UNION ALL
  SELECT 2, 7, 2
), table_b AS (
  SELECT 1 id UNION ALL
  SELECT 2
)
SELECT a.*, b.id idb 
FROM (
  SELECT * FROM table_a 
  WHERE p > 5 AND q != 0
) a
LEFT JOIN table_b b
ON a.id = b.id

with result

Row id  p   q   idb  
1   1   6   1   1    
2   2   7   2   2    

comparing those two - you can easily see the difference in result and hopefully see now why :o)

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
0

This has nothing to do with BigQuery, only with your understanding of LEFT JOIN.

A LEFT JOIN keeps all rows in the first table, regardless of whether the ON clause evaluates to true, false, or NULL.

That means that filters on the first table have no impact in the FROM clause. Well, they do have an impact -- otherwise matching rows from the second table will have NULL values. However, hardly anyone notices or desires that effect.

The rules for outer joins are simple:

  • For LEFT JOIN, filters on the first table go in the WHERE clause.
  • For LEFT JOIN, filters on subsequent tables go in the ON clause (otherwise the outer join becomes an inner join).
  • For RIGHT JOIN . . . replace with LEFT JOIN. (This can usually be done and most people find the query to be easier to follow.)
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786