COUNT with multiple LEFT joins

Question

I am having some troubles with a count function. The problem is given by a left join that I am not sure I am doing correctly.

Variables are:

Customer_name (buyer)
Product_code (what the customer buys)
Store (where the customer buys)

The datasets are:

Customer_df (list of customers and product codes of their purchases)
Store1_df (list of product codes per week, for Store 1)
Store2_df (list of product codes per day, for Store 2)

Final output desired: I would like to have a table with:

col1: Customer_name;
col2: Count of items purchased in store 1;
col3: Count of items purchased in store 2;
Filters: date range

My query looks like this:

SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,

FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)

GROUP BY 
customer_name, product_code

HAVING
S1_sales > 0
OR S2_sales > 0

The output I expect is something like this:

Customer_name	Product_code	Store1_weekly_sales	Store2_weekly_sales
Luigi	120012	4	8
James	100022	6	10

But instead, I get:

Customer_name	Product_code	Store1_weekly_sales	Store2_weekly_sales
Luigi	120012	290	60
James	100022	290	60

It works when instead of COUNT(product_code) I do COUNT(DSITINCT product_code) but I would like to avoid that because I would like to be able to aggregate on different timespans (e.g. if I do count distinct and take into account more than 1 week of data I will not get the right numbers)

My hypothesis are:

I am joining the tables in the wrong way
There is a problem when joining two datasets with different time aggregations

What am I doing wrong?

Your hypotheses are correct, I think. It sounds like this is happening because you have multiple rows in store1_df that match using product_code. Without sample data, any solutions would be guessing. — Andrew, Mar 07 '22 at 18:34
A `DISTINCT` together with a `GROUP BY` doesn't make a lot of sense, there's no way you can get multiple rows per group. I suppose a customer can order the same product multiple times, so I'm missing a date component in your query. A solution could be to `DISTINCT` `storex_df` in a sub-select. But I'm not really clear what your data looks like, and what you want to show — HoneyBadger, Mar 07 '22 at 20:09
Please in code questions give a [mre]--cut & paste & runnable code & example input; desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. For SQL include DDL & tabular initialization code. For debug that includes the least code you can give that is code that you show is OK extended by code that you show is not OK. [ask] [Help] When you get a result you don't expect, pause your overall goal, chop to the 1st subexpression with unexpected result & say what you expected & why, justified by documentation. (Debugging fundamental.) — philipxy, Mar 07 '22 at 21:42
This seems likely to be a common error where people want some joins, each possibly involving a different key, of some subqueries, each possibly involving join and/or aggregation, but they erroneously try to do all the joining then all the aggregating or to aggregate over previous aggregations. Write separate aggregate subqueries over appropriate rows and/or aggregate (maybe distinct) case statements picking rows of (possibly join result) tables; then join the subqueries together. — philipxy, Mar 07 '22 at 21:43

score 1 · Answer 1 · answered Mar 08 '22 at 02:06

The reason as Philipxy indicated is common. You are getting a Cartesian result from your data thus bloating your numbers. To simplify, lets consider just a single customer purchasing one item from two stores. The first store has 3 purchases, the second store has 5 purchases. Your total count is 3 * 5. This is because for each entry in the first is also joined by the same customer id in the second. So 1st purchase is joined to second store 1-5, then second purchase joined to second store 1-5 and you can see the bloat. So, by having each store pre-query the aggregates per customer will have AT MOST, one record per customer per store (and per product as per your desired outcome).

select
      c.customer_name,
      AllCustProducts.Product_Code,
      coalesce( PQStore1.SalesEntries, 0 ) Store1SalesEntries,
      coalesce( PQStore2.SalesEntries, 0 ) Store2SalesEntries
   from
      customer_df c
         -- now, we need all possible UNIQUE instances of 
         -- a given customer and product to prevent duplicates
         -- for subsequent queries of sales per customer and store
         JOIN
         ( select distinct customerid, product_code
              from store1_df 
           union
           select distinct customerid, product_code
              from store2_df ) AllCustProducts
            on c.customerid = AllCustProducts.customerid
            -- NOW, we can join to a pre-query of sales at store 1
            -- by customer id and product code.  You may also want to
            -- get sum( SalesDollars ) if available, just add respectively
            -- to each sub-query below.
            LEFT JOIN
            ( select
                    s1.customerid,
                    s1.product_code,
                    count(*) as SalesEntries
                 from
                    store1_df s1
                 group by
                    s1.customerid,
                    s1.product_code ) PQStore1
              on AllCustProducts.customerid = PQStore1.customerid
              AND AllCustProducts.product_code = PQStore1.product_code
            -- now, same pre-aggregation to store 2
            LEFT JOIN
            ( select
                    s2.customerid,
                    s2.product_code,
                    count(*) as SalesEntries
                 from
                    store2_df s2
                 group by
                    s2.customerid,
                    s2.product_code ) PQStore2
              on AllCustProducts.customerid = PQStore2.customerid
              AND AllCustProducts.product_code = PQStore2.product_code

No need for a group by or having since all entries in their respective pre-aggregates will result in a maximum of 1 record per unique combination. Now, as for your needs to filter by date ranges. I would just add a WHERE clause within each of the AllCustProducts, PQStore1, and PQStore2.

COUNT with multiple LEFT joins

1 Answers1