0

I am trying to write a SQL query that returns the name and purchase amount of the five customers in each state who have spent the most money.

Table schemas

customers
|_state
|_customer_id
|_customer_name

transactions
|_customer_id
|_transact_amt

Attempts look something like this

SELECT state, Sum(transact_amt) AS HighestSum
FROM (
    SELECT name, transactions.transact_amt, SUM(transactions.transact_amt) AS HighestSum
    FROM customers
    INNER JOIN customers ON transactions.customer_id = customers.customer_id
    GROUP BY state
) Q
GROUP BY transact_amt
ORDER BY HighestSum 

I'm lost. Thank you.

Expected results are the names of customers with the top 5 highest transactions in each state.

ERROR:  table name "customers" specified more than once
SQL state: 42712
drworm
  • 13
  • 1
  • 1
  • 4

3 Answers3

1

First, you need for your JOIN to be correct. Second, you want to use window functions:

SELECT ct.*
FROM (SELECT c.customer_id, c.name, c.state, SUM(t.transact_amt) AS total,
             ROW_NUMBER() OVER (PARTITION BY c.state ORDER BY SUM(t.transact_amt) DESC) as seqnum
      FROM customers c JOIN
           transaactions t
           ON t.customer_id = c.customer_id
      GROUP BY c.customer_id, c.name, c.state
     ) ct
WHERE seqnum <= 5;

You seem to have several issues with SQL. I would start with understanding aggregation functions. You have a SUM() with the alias HighestSum. It is simply the total per customer.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

You can get them using aggregation and then by using the RANK() window function. For example:

select
  state,
  rk,
  customer_name
from (
  select
    *,
    rank() over(partition by state order by total desc) as rk
  from (
    select
      c.customer_id,
      c.customer_name,
      c.state,
      sum(t.transact_amt) as total
    from customers c
    join transactions t on t.customer_id = c.customer_id
    group by c.customer_id
  ) x
) y
where rk <= 5
order by state, rk
The Impaler
  • 45,731
  • 9
  • 39
  • 76
  • As the question is not clear, if OP wants exactly 5 (i.e if there are 10 customers that have spent the very same amount, get only 5 of them and ignore the rest), he can replace `rank()` with `row_number()`. – Svetlin Zarev Aug 21 '19 at 18:54
1

There are two valid answers already. Here's a third:

SELECT *
FROM  (
   SELECT c.state, c.customer_name, t.*
        , row_number() OVER (PARTITION BY c.state ORDER BY t.transact_sum DESC NULLS LAST, customer_id) AS rn
   FROM  (
      SELECT customer_id, sum(transact_amt) AS transact_sum
      FROM   transactions
      GROUP  BY customer_id
      ) t
   JOIN   customers c USING (customer_id)
   ) sub
WHERE  rn < 6
ORDER  BY state, rn;

Major points

  • When aggregating all or most rows of a big table, it's typically substantially faster to aggregate before the join. Assuming referential integrity (FK constraints), we won't be aggregating rows that would be filtered otherwise. This might change from nice-to-have to a pure necessity when joining to more aggregated tables. Related:

  • Add additional ORDER BY item(s) in the window function to define which rows to pick from ties. In my example, it's simply customer_id. If you have no tiebreaker, results are arbitrary in case of a tie, which may be OK. But every other execution might return different results, which typically is a problem. Or you include all ties in the result. Then we are back to rank() instead of row_number(). See:

  • While transact_amt can be NULL (has not been ruled out) any sum may end up to be NULL as well. With an an unsuspecting ORDER BY t.transact_sum DESC those customers come out on top as NULL comes first in descending order. Use DESC NULLS LAST to avoid this pitfall. (Or define the column transact_amt as NOT NULL.)

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Just want to know. The condition like `WHERE rn < 6` means that the whole dataset (100-1000-1000000 items) should be retrieved/calculated before and only several (6) items from it will be returned by the query. Probably, using pl/pgsql could be more efficient? – Abelisto Aug 21 '19 at 21:29
  • 1
    @Abelisto: Well, how would you identify the top 5 *without* looking at all rows? – Erwin Brandstetter Aug 21 '19 at 21:32
  • Some clarification: is `select * from (select row_number() over () as rn, * from sometable) as t where t.rn < 6` would be more efficient then `rn := 1; for v in select * from sometable loop return next; rn := rn + 1; if rn > 6 then break; end loop` – Abelisto Aug 21 '19 at 21:44
  • 1
    @Abelisto: Ah, you mean: avoid sorting the long tail? Not sure there is potential. Alternatives have expensive overhead. Also not sure whether Postgres is smart enough to optimize that automatically (and fully). A CTE *might* help with it. Might be worth a question on dba.SE ... – Erwin Brandstetter Aug 21 '19 at 22:08
  • Yes. In the first case wee need to 1) retrieve the whole data 2) calculate some value on each row 3) filter the result on some condition related to the calculated values. And in the second case wee need 1) retrieve only the n-s rows... and that's all. It is why I do not like answers related to filters on window-functions. – Abelisto Aug 21 '19 at 22:26