1

I'm banging my head against the wall.. I have 2 tables, orders and batches. I'm using PostgreSQL 9.

These queries work and return results as I'd expect:

Query #1 (works appropriately, returning order #12, #8, etc.):

SELECT "orders".* 
FROM "orders" 
    INNER JOIN "batches" ON "batches"."order_id" = "orders"."id" 
WHERE (order_id not IN (14,4,13,5,2,9,3,7,11));

Query #2 (works appropriately, returning unique order IDs that have batches with an estimated end date later than 2019-08-06 - the same as the numbers listed above):

SELECT DISTINCT "batches"."order_id" 
FROM "batches" 
WHERE (estimated_end_date >= '2019-08-06');

However, if I put select query #2 in place of the number list, it doesn't return the same results as query #1 as I'd expect:

SELECT "orders".* 
FROM "orders" 
    INNER JOIN "batches" ON "batches"."order_id" = "orders"."id" 
WHERE (order_id not IN (SELECT DISTINCT "batches"."order_id"          
                                FROM "batches" 
                                WHERE (estimated_end_date >= '2019-08-06')))

What am I doing wrong? How do I get the nested subquery to work properly? THANK YOU!

karns
  • 5,391
  • 8
  • 35
  • 57

4 Answers4

3

Add the condition:

"batches"."order_id" IS NOT NULL

in the subquery to avoid comparing order_id against nulls:

SELECT "service_orders".* 
FROM "service_orders" INNER JOIN "batches" 
ON "batches"."order_id" = "orders"."id" 
WHERE order_id not IN (
  SELECT DISTINCT "batches"."order_id"          
  FROM "batches" 
  WHERE estimated_end_date >= '2019-08-06' AND "batches"."order_id" IS NOT NULL
)
forpas
  • 160,666
  • 10
  • 38
  • 76
  • 2
    Thank you, this did the trick!! For some reason I overlooked this since I was thinking only in terms of my exclusive tests which returned just the set of numbers I expected. Also, batches didn't have null order_id in development, only in production since the order table was new.. – karns Aug 06 '19 at 19:00
2

Do not use not in with subqueries. If any of the returned values are null, then you will get no results at all. Use not exists instead:

SELECT so.* 
FROM service_orders so INNER JOIN
     batches b
     ON b.service_order_id = so.id
WHERE NOT EXISTS (SELECT          
                  FROM batches b2
                  WHERE b2.estimated_end_date >= '2019-08-06' AND
                        b2.service_order_id = b.service_order_id
                );

There are simpler ways to express this logic, such as using window functions:

SELECT so.* 
FROM service_orders so INNER JOIN
     (SELECT b.*,
             MAX(b.estimated_end_date) OVER (PARTITION BY b.service_order_id) as max_eed
      FROM batches b
     ) b
     ON b.service_order_id = so.id
WHERE max_eed <= '2019-08-06' ;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Would there be a reason to use not exists instead of using IS NOT NULL as proposed in the other answer? Also, appreciate the alternative solution - is there a reason why this would be preferable? – karns Aug 06 '19 at 18:59
  • 3
    Yes, it can also be optimized better: https://wiki.postgresql.org/wiki/Don't_Do_This#Don.27t_use_NOT_IN – Jeremy Aug 06 '19 at 20:48
1

You can convert this query

SELECT "orders".* 
FROM orders 
    INNER JOIN "batches" ON "batches"."service_order_id" = "service_orders"."id" 
WHERE (service_order_id not IN (SELECT DISTINCT "batches"."service_order_id"          
                                FROM "batches" 
                                WHERE (estimated_end_date >= '2019-08-06')))

to a join

SELECT orders.* 
FROM orders 
INNER JOIN "batches" ON "batches"."service_order_id" = orders.id 
LEFT JOIN batches AS S ON s.service_order_id = orders.id and s.estimated_end_date >= '2019-08-06'
WHERE S.service_order_id is null
Hogan
  • 69,564
  • 10
  • 76
  • 117
0

Your subquery is probably returning a NULL result on at least one of the records. As a rule, when using NOT IN and a subquery, you must prepare for a null result or the whole query will return zero records.

User Brannon explains it perfectly over here: NOT IN clause and NULL values

To fix this, I like to use Coelesce in the subquery's SELECT statement to replace any existing NULL with a value that's irrelevant. This way you don't have to limit the subquery by a statement in the WHERE clause.

SELECT "orders".* 
FROM "orders" 
    INNER JOIN "batches" ON "batches"."order_id" = "orders"."id" 
WHERE (order_id not IN (SELECT DISTINCT coalesce("batches"."order_id",-1)
                                FROM "batches" 
                                WHERE (estimated_end_date >= '2019-08-06')))

You know your data but I'd be willing to bet there's no way a -1 could exist in the order_id field, so now, any null that's returned in the subquery is replaced with a "-1" and your query will complete.