I am having some troubles with a count function. The problem is given by a left join that I am not sure I am doing correctly.
Variables are:
- Customer_name (buyer)
- Product_code (what the customer buys)
- Store (where the customer buys)
The datasets are:
- Customer_df (list of customers and product codes of their purchases)
- Store1_df (list of product codes per week, for Store 1)
- Store2_df (list of product codes per day, for Store 2)
Final output desired: I would like to have a table with:
- col1: Customer_name;
- col2: Count of items purchased in store 1;
- col3: Count of items purchased in store 2;
- Filters: date range
My query looks like this:
SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,
FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)
GROUP BY
customer_name, product_code
HAVING
S1_sales > 0
OR S2_sales > 0
The output I expect is something like this:
Customer_name | Product_code | Store1_weekly_sales | Store2_weekly_sales |
---|---|---|---|
Luigi | 120012 | 4 | 8 |
James | 100022 | 6 | 10 |
But instead, I get:
Customer_name | Product_code | Store1_weekly_sales | Store2_weekly_sales |
---|---|---|---|
Luigi | 120012 | 290 | 60 |
James | 100022 | 290 | 60 |
It works when instead of COUNT(product_code) I do COUNT(DSITINCT product_code) but I would like to avoid that because I would like to be able to aggregate on different timespans (e.g. if I do count distinct and take into account more than 1 week of data I will not get the right numbers)
My hypothesis are:
- I am joining the tables in the wrong way
- There is a problem when joining two datasets with different time aggregations
What am I doing wrong?