1

I'm trying to create a query to retrieve all customers who purchased one SKU ('Red399') but did not also purchase a second SKU ('Red323'). It seems like the best way to do this is by using NOT EXISTS with a subquery to filter out people who bought 'Red323'.

My query isn't returning any errors, but it also isn't returning any results, and I think it might be because I have too many conditions in the initial WHERE clause but I'm not sure:

SELECT DISTINCT o."FirstName", o."LastName", o."Email", ol."SKU"
FROM flight_export_order o
JOIN flight_export_orderline ol
    ON o."OrderDisplayID" = ol."OrderDisplayID"
WHERE ol."SKU" = 'Red399'
AND o."OrderDate" BETWEEN '07/22/2020' AND '08/03/2020'
AND NOT EXISTS 
(SELECT DISTINCT o."Email" 
    FROM flight_export_order o
    JOIN flight_export_orderline ol
        ON o."OrderDisplayID" = ol."OrderDisplayID"
    WHERE ol."SKU" = 'Red323'
    AND o."OrderDate" BETWEEN '07/22/2020' AND '08/03/2020')
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
ekeckeisen
  • 75
  • 8

2 Answers2

1

You don't need the subquery.
You can group by customer and set the conditions in the having clause:

SELECT o."FirstName", o."LastName", o."Email"
FROM flight_export_order o INNER JOIN flight_export_orderline ol
ON o."OrderDisplayID" = ol."OrderDisplayID"
WHERE ol."SKU" IN ('Red399', 'Red323')
  AND o."OrderDate" BETWEEN '07/22/2020' AND '08/03/2020'
GROUP BY o."FirstName", o."LastName", o."Email"
HAVING SUM((ol."SKU" = 'Red399')::int) > 0
   AND SUM((ol."SKU" = 'Red323')::int) = 0
forpas
  • 160,666
  • 10
  • 38
  • 76
  • Thank you, this works! Just to clarify so I understand why this works- you are getting the numbers of each bottle purchased by casting the SKU as an int, correct? – ekeckeisen Sep 09 '20 at 14:34
  • For example: the boolean expression `ol."SKU" = 'Red399'` is casted to an int: 0 or 1 for false and true. So by summing these 0s and 1s if you get a number > 0 then there was a purchase of `'Red399'`. – forpas Sep 09 '20 at 14:37
1

You were on the right track to begin with. EXISTS / NOT EXISTS are the right tools:

SELECT o.* --  or just the columns you need
FROM   flight_export_order o
WHERE  o."OrderDate" BETWEEN '2020-07-22' AND '2020-08-03'
AND    EXISTS (
   SELECT FROM flight_export_orderline
   WHERE  "OrderDisplayID" = o."OrderDisplayID"
   AND    "SKU" = 'Red399'
   )
AND    NOT EXISTS (
   SELECT FROM flight_export_orderline
   WHERE  "OrderDisplayID" = o."OrderDisplayID"
   AND    "SKU" = 'Red323'
   );

With a multicolumn index on flight_export_orderline ("OrderDisplayID", "SKU"), this is as fast as it gets. An index on just ("OrderDisplayID") (like you probably have) goes a long way, too.

Plus an index on flight_export_order("OrderDate"), obviously.

I see no need for any expensive aggregating or DISTINCT. See:

Aside 1: try to avoid quoted CaMeL-case identifiers in Postgres if you can. See:

Aside 2: It's reommended to use ISO 8601 date format (YYYY-MM-DD), which is always unambiguous and independent of locale and session settings.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228