amazon-redshift : select id, first(a),sum(b) group by id

Question

In mysql/SparkSQL we have first function. which doesn't exist in Redshift.

I have to change code

SELECT
  product_id,
  first(product_code) as product_code,
  first(product_name) as product_name,
  first(time_date) as time_date, 
  max(price_max) as price_max,
  min(price_min) as price_min,
  sum(count_of_sales) as count_of_sales,
  SUM(CASE WHEN time_date = 1538323200000 THEN cost_of_stock_start ELSE 0 END) as cost_of_stock_start,
from storeproductincomelogs 
WHERE time_date>= 1538323200000 
  AND time_date<= 1541001600000 
group by product_id;

to

SELECT
  product_id,
  product_code,
  product_name,
  min(time_date) as time_date,  # have to change first to min, this column can't group by
  max(price_max) as price_max,
  min(price_min) as price_min,
  sum(count_of_sales) as count_of_sales,
  SUM(CASE WHEN time_date = 1538323200000 THEN cost_of_stock_start ELSE 0 END) as cost_of_stock_start,
from storeproductincomelogs 
WHERE time_date>= 1538323200000 
  AND time_date<= 1541001600000 
group by product_id,product_code,product_name;

Without first, I have to add product_code,product_name in group by clasuse. Or would get error:

Invalid operation: column "storeproductincomelogs.product_code" must appear in the GROUP BY clause or be used in an aggregate function;

Note: here product_id,product_code are unique in each row, and duplicated product_name is also hardly found (but may be appear in future so I think I can't use group by).

I searched mysql first equivalent in postgresql and got Select first row in each GROUP BY group? .

First, I tried Distinct on clause which is not supported in Redshift.

Then, I tried:

WITH summary AS (
    SELECT product_id,
           product_code,
           product_name,
            min(time_date) as time_date,
            max(price_max) as price_max,
            sum(count_of_sales) as count_of_sales,
            SUM(CASE WHEN time_date = 1538323200000 THEN cost_of_stock_start ELSE 0 END) as cost_of_stock_start,
           ROW_NUMBER() OVER(PARTITION BY product_id ) AS rk
      FROM  storeproductincomelogs)
SELECT *
  FROM summary
 WHERE rk = 1;

got error

[42803][500310] Amazon Invalid operation: column "storeproductincomelogs.product_id" must appear in the GROUP BY clause or be used in an aggregate function;

I don't know how to write correct one, so can't test the performance.

How can I do this in Redshift?

what is your expected output ? and it would be good if you provide your sample data — Fahmi, Jan 21 '19 at 08:20

trincot · Accepted Answer · 2019-01-21T09:59:38.717

1

As I understand you don't want to group by product_code and product_name because there is a tiny possibility that they are not always the same for a given product id.

So, then I would suggest to just take min (or max) of those two fields also:

SELECT
  product_id,
  min(product_code) as product_code,
  min(product_name) as product_name,
  min(time_date) as time_date,
  max(price_max) as price_max,
  ... ...
group by product_id;

edited Jan 21 '19 at 09:59

answered Jan 21 '19 at 08:25

trincot

317,000
35
244
286

`first in order by product id` is fine. `product_id` would not be duplicated, `product_code` and `product_code` are for better view, they may have duplicates but not seriously . Because I need dump the result to xlsx. – Mithril Jan 21 '19 at 08:34
Understand, I need add `OVER(PARTITION BY product_id )` for each column which need aggregation . But it become extemely slow, much more slow than group by . – Mithril Jan 21 '19 at 08:36
I suppose you have an index on `storeproductincomlogs.product_id`? – trincot Jan 21 '19 at 08:53
I also noted in your question that you have a field `time_date`: that seems a good candidate for the `order by` in the window. – trincot Jan 21 '19 at 08:55
Index is not supported and no need in redshift(column-oriented DBMS). I don't need order by too . Atcually there is more than 10 column need `sum` ,`product_code, product_name, time_date` are only for better view. – Mithril Jan 21 '19 at 09:16
In fact, in your `group by product_id,product_code,product_name;` attempt you should not get that error that you mentioned in the question. Are you sure you don't reference `product_code` elsewhere in the query? – trincot Jan 21 '19 at 09:42
`group by product_id,product_code,product_name` does not cause that error . But I only want `group by product_id`, that is the key problem . As I said in question, `product_name` is also hardly found (but may be appear in future so I think I can't use group by) – Mithril Jan 21 '19 at 09:52
What do you mean with "hardly found"? – trincot Jan 21 '19 at 09:54
Now, there is no duplicated `product_name`, but may have in future . – Mithril Jan 21 '19 at 09:54
Oh, you mean that for the same product id, you would have two different names? – trincot Jan 21 '19 at 09:55
Yes! That is I want to avoid . – Mithril Jan 21 '19 at 09:56
But in that case, why not do `select ... min(product_name) as product_name, ...`? – trincot Jan 21 '19 at 09:56
product_name is sting , can use ` min ` ?? – Mithril Jan 21 '19 at 09:56
Wow, I have never heard this , it works , thank you ! – Mithril Jan 21 '19 at 09:59

score 0 · Answer 2 · answered Jan 21 '19 at 08:21

0

You can try below - you need to add ROW_NUMBER() OVER(PARTITION BY product_id order by price_max desc) which will give you product wise max price

WITH summary AS (
    SELECT product_id,
           product_code,
           product_name,
           price_max,
           ROW_NUMBER() OVER(PARTITION BY product_id order by price_max desc) AS rk
      FROM  storeproductincomelogs)
SELECT *
  FROM summary
 WHERE rk = 1;

answered Jan 21 '19 at 08:21

Fahmi

37,315
5
22
31

No, you can't use order by to trick `max`. How do you cope with `sum` and `min` in same time ? – Mithril Jan 21 '19 at 08:24

amazon-redshift : select id, first(a),sum(b) group by id

2 Answers2