-1

I am having some troubles with a count function. The problem is given by a left join that I am not sure I am doing correctly.

Variables are:

  1. Customer_name (buyer)
  2. Product_code (what the customer buys)
  3. Store (where the customer buys)

The datasets are:

  1. Customer_df (list of customers and product codes of their purchases)
  2. Store1_df (list of product codes per week, for Store 1)
  3. Store2_df (list of product codes per day, for Store 2)

Final output desired: I would like to have a table with:

  1. col1: Customer_name;
  2. col2: Count of items purchased in store 1;
  3. col3: Count of items purchased in store 2;
  4. Filters: date range

My query looks like this:

SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,

FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)

GROUP BY 
customer_name, product_code

HAVING
S1_sales > 0
OR S2_sales > 0

The output I expect is something like this:

Customer_name Product_code Store1_weekly_sales Store2_weekly_sales
Luigi 120012 4 8
James 100022 6 10

But instead, I get:

Customer_name Product_code Store1_weekly_sales Store2_weekly_sales
Luigi 120012 290 60
James 100022 290 60

It works when instead of COUNT(product_code) I do COUNT(DSITINCT product_code) but I would like to avoid that because I would like to be able to aggregate on different timespans (e.g. if I do count distinct and take into account more than 1 week of data I will not get the right numbers)

My hypothesis are:

  • I am joining the tables in the wrong way
  • There is a problem when joining two datasets with different time aggregations

What am I doing wrong?

Ricco D
  • 6,873
  • 1
  • 8
  • 18
  • Your hypotheses are correct, I think. It sounds like this is happening because you have multiple rows in store1_df that match using product_code. Without sample data, any solutions would be guessing. – Andrew Mar 07 '22 at 18:34
  • A `DISTINCT` together with a `GROUP BY` doesn't make a lot of sense, there's no way you can get multiple rows per group. I suppose a customer can order the same product multiple times, so I'm missing a date component in your query. A solution could be to `DISTINCT` `storex_df` in a sub-select. But I'm not really clear what your data looks like, and what you want to show – HoneyBadger Mar 07 '22 at 20:09
  • Also, a table for each store sounds like a design flaw – HoneyBadger Mar 07 '22 at 20:10
  • Please in code questions give a [mre]--cut & paste & runnable code & example input; desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. For SQL include DDL & tabular initialization code. For debug that includes the least code you can give that is code that you show is OK extended by code that you show is not OK. [ask] [Help] When you get a result you don't expect, pause your overall goal, chop to the 1st subexpression with unexpected result & say what you expected & why, justified by documentation. (Debugging fundamental.) – philipxy Mar 07 '22 at 21:42
  • This seems likely to be a common error where people want some joins, each possibly involving a different key, of some subqueries, each possibly involving join and/or aggregation, but they erroneously try to do all the joining then all the aggregating or to aggregate over previous aggregations. Write separate aggregate subqueries over appropriate rows and/or aggregate (maybe distinct) case statements picking rows of (possibly join result) tables; then join the subqueries together. – philipxy Mar 07 '22 at 21:43

1 Answers1

1

The reason as Philipxy indicated is common. You are getting a Cartesian result from your data thus bloating your numbers. To simplify, lets consider just a single customer purchasing one item from two stores. The first store has 3 purchases, the second store has 5 purchases. Your total count is 3 * 5. This is because for each entry in the first is also joined by the same customer id in the second. So 1st purchase is joined to second store 1-5, then second purchase joined to second store 1-5 and you can see the bloat. So, by having each store pre-query the aggregates per customer will have AT MOST, one record per customer per store (and per product as per your desired outcome).

select
      c.customer_name,
      AllCustProducts.Product_Code,
      coalesce( PQStore1.SalesEntries, 0 ) Store1SalesEntries,
      coalesce( PQStore2.SalesEntries, 0 ) Store2SalesEntries
   from
      customer_df c
         -- now, we need all possible UNIQUE instances of 
         -- a given customer and product to prevent duplicates
         -- for subsequent queries of sales per customer and store
         JOIN
         ( select distinct customerid, product_code
              from store1_df 
           union
           select distinct customerid, product_code
              from store2_df ) AllCustProducts
            on c.customerid = AllCustProducts.customerid
            -- NOW, we can join to a pre-query of sales at store 1
            -- by customer id and product code.  You may also want to
            -- get sum( SalesDollars ) if available, just add respectively
            -- to each sub-query below.
            LEFT JOIN
            ( select
                    s1.customerid,
                    s1.product_code,
                    count(*) as SalesEntries
                 from
                    store1_df s1
                 group by
                    s1.customerid,
                    s1.product_code ) PQStore1
              on AllCustProducts.customerid = PQStore1.customerid
              AND AllCustProducts.product_code = PQStore1.product_code
            -- now, same pre-aggregation to store 2
            LEFT JOIN
            ( select
                    s2.customerid,
                    s2.product_code,
                    count(*) as SalesEntries
                 from
                    store2_df s2
                 group by
                    s2.customerid,
                    s2.product_code ) PQStore2
              on AllCustProducts.customerid = PQStore2.customerid
              AND AllCustProducts.product_code = PQStore2.product_code

No need for a group by or having since all entries in their respective pre-aggregates will result in a maximum of 1 record per unique combination. Now, as for your needs to filter by date ranges. I would just add a WHERE clause within each of the AllCustProducts, PQStore1, and PQStore2.

DRapp
  • 47,638
  • 12
  • 72
  • 142