In mysql/SparkSQL we have first
function. which doesn't exist in Redshift.
I have to change code
SELECT
product_id,
first(product_code) as product_code,
first(product_name) as product_name,
first(time_date) as time_date,
max(price_max) as price_max,
min(price_min) as price_min,
sum(count_of_sales) as count_of_sales,
SUM(CASE WHEN time_date = 1538323200000 THEN cost_of_stock_start ELSE 0 END) as cost_of_stock_start,
from storeproductincomelogs
WHERE time_date>= 1538323200000
AND time_date<= 1541001600000
group by product_id;
to
SELECT
product_id,
product_code,
product_name,
min(time_date) as time_date, # have to change first to min, this column can't group by
max(price_max) as price_max,
min(price_min) as price_min,
sum(count_of_sales) as count_of_sales,
SUM(CASE WHEN time_date = 1538323200000 THEN cost_of_stock_start ELSE 0 END) as cost_of_stock_start,
from storeproductincomelogs
WHERE time_date>= 1538323200000
AND time_date<= 1541001600000
group by product_id,product_code,product_name;
Without first
, I have to add product_code,product_name
in group by clasuse.
Or would get error:
Invalid operation: column "storeproductincomelogs.product_code" must appear in the GROUP BY clause or be used in an aggregate function;
Note: here product_id,product_code
are unique in each row, and duplicated product_name
is also hardly found (but may be appear in future so I think I can't use group by).
I searched mysql first
equivalent in postgresql and got Select first row in each GROUP BY group? .
First, I tried Distinct on
clause which is not supported in Redshift.
Then, I tried:
WITH summary AS (
SELECT product_id,
product_code,
product_name,
min(time_date) as time_date,
max(price_max) as price_max,
sum(count_of_sales) as count_of_sales,
SUM(CASE WHEN time_date = 1538323200000 THEN cost_of_stock_start ELSE 0 END) as cost_of_stock_start,
ROW_NUMBER() OVER(PARTITION BY product_id ) AS rk
FROM storeproductincomelogs)
SELECT *
FROM summary
WHERE rk = 1;
got error
[42803][500310] Amazon Invalid operation: column "storeproductincomelogs.product_id" must appear in the GROUP BY clause or be used in an aggregate function;
I don't know how to write correct one, so can't test the performance.
How can I do this in Redshift?