Get the distinct sum of a joined table column

Question

I have a problem here, and I'm hoping there is an easy solution. I'll try to make this as simple as possible:

A ticket belongs to an attendee
Example:

select * from tickets JOIN attendees ON attendee.id = tickets.attendee_id

An attendee has a decimal column called "revenue"

That said, I need to run a query that will return a variety of information about the tickets, including the total revenue. The problem is that if 2 tickets belong to the same attendee, it counts their revenue twice. How can I sum the attendee revenue only once?

I don't want to use subqueries as my ORM makes this difficult. Plus a sub query solution doesn't scale if I want to do this for multiple columns.

Here's what I have:

1 attendees with a revenue of 100
2 tickets that both belong to that attendee

Select count(tickets.*) as tickets_count
     , sum(attendees.revenue) as atendees_revenue
from tickets LEFT OUTER JOIN attendees ON attendees.id = tickets.attendee_id;

=> This tells me that attendees_revenue is 200. I want it to be 100. Since there is one attendee in the database with an existing_revenue of 100. I do NOT want the attendee to be double counted.

Please let me know if this is possible.

With this kind of question it is extremely helpful if you provide a schema and sample data, preferably as an SQLFiddle - see http://sqlfiddle.com/ . That allows people writing answers to easily test with *your* schema and to make sure their queries work correctly without first having to create their own dummy data. Also: You've tagged this `SQL`, not `my-orm`. If you want SQL queries the ORM shouldn't matter; if you don't want SQL we need to know what your ORM is. — Craig Ringer, Nov 01 '12 at 00:33
Please also show your existing subquery solution (preferably running on the SQLFiddle example you create) so we can see what the result you've trying to achieve is. The `explain analyze` of it might be informative; paste that to http://explain.depesz.com/ and link back to it here. — Craig Ringer, Nov 01 '12 at 00:36
I think you explained the problem well, if you don't mind, try testing the following SQL against your data set: SELECT COUNT (tickets.*) AS tickets_count, SUM (DISTINCT attendees.revenue) AS atendees_revenue FROM tickets LEFT OUTER JOIN attendees ON attendees.id = tickets.attendee_id — Roberto Navarro, Nov 01 '12 at 00:39
@RobertoNavarro Indeed, it's well explained (+1), and that's appreciated. I'm just being greedy and suggesting what'd be *ideal*. — Craig Ringer, Nov 01 '12 at 00:42
@RobertoNavarro: I disagree. A table definition and some sample data is the minimum that can be expected from somebody who asks for help. — Erwin Brandstetter, Nov 01 '12 at 01:00
As for `A sub query solution doesn't scale if I want to do this for multiple columns.` Depending on what "do this for multiple columns" is supposed to mean, I would probably challenge this as unfounded. — Erwin Brandstetter, Nov 01 '12 at 01:55

Erwin Brandstetter · Accepted Answer · 2020-04-28T14:21:50.050

To get the result without subquery, you have to resort to advanced window function trickery:

SELECT sum(count(*))       OVER () AS tickets_count
     , sum(min(a.revenue)) OVER () AS atendees_revenue
FROM   tickets   t
JOIN   attendees a ON a.id = t.attendee_id
GROUP  BY t.attendee_id
LIMIT  1;

sqlfiddle

How does it work?

The key to understanding this is the sequence of events in the query:

aggregate functions -> window functions -> DISTINCT -> LIMIT

More details:

Best way to get result count before LIMIT was applied

Step by step:

I GROUP BY t.attendee_id - which you would normally do in a subquery.
Then I sum over the counts to get the total count of tickets. Not very efficient, but forced by your requirement. The aggregate function count(*) is wrapped in the window function sum( ... ) OVER () to arrive at the not-so-common expression: sum(count(*)) OVER ().

And sum the minimum revenue per attendee to get the sum without duplicates.

You could also use max() or avg() instead of min() to the same effect as revenue is guaranteed to be the same for every row per attendee.

This could be simpler if DISTINCT was allowed in window functions, but PostgreSQL has not (yet) implemented this feature. Per documentation:

Aggregate window functions, unlike normal aggregate functions, do not allow DISTINCT or ORDER BY to be used within the function argument list.
Final step is to get a single row. This could be done with DISTINCT (SQL standard) since all rows are the same. LIMIT 1 will be faster, though. Or the SQL-standard form FETCH FIRST 1 ROWS ONLY.

Heriberto Magaña · Answer 2 · 2017-11-21T00:00:20.613

3

What about a simple division:

 Select count(tickets.*) as tickets_count
 , sum(attendees.revenue) / count(attendees.id) as atendees_revenue
 from tickets LEFT OUTER JOIN attendees ON attendees.id = 
 tickets.attendee_id;

This should handle duplicates, triplicates, etcetera.

edited Nov 21 '17 at 00:00

answered Nov 20 '17 at 23:03

Heriberto Magaña

882
10
11

score -1 · Answer 3 · answered Aug 26 '17 at 21:14

-1

Previous answer is nearly correct. You just need to make distinct work well in case identical revenues. You can fix this really simple if your id has numeric type:

SELECT
  COUNT(*) AS ticketsCount,
  SUM(DISTINCT attendees.id + attendees.revenue) - SUM(DISTINCT attendees.id) AS revenueSum
FROM
  tickets
LEFT JOIN attendees ON
  attendees.id = tickets.attendee_id

answered Aug 26 '17 at 21:14

mikezig92

1

Quite late to the party, but this won't work either. An attendee with id `6` and revenue `4` will not be distinct from the attendee with id `4` and revenue `6`. So in this case the result (`10 - 10 = 0`) won't match the expected result (`10`). – Sander van Leeuwen Jun 23 '21 at 11:43

score -3 · Answer 4 · answered Nov 01 '12 at 00:56

-3

You were actually pretty close, there's many way to do this and if I understand your question correctly this should do it :

SELECT
   COUNT(*) AS ticketsCount,
   SUM(DISTINCT attendees.revenue) AS revenueSum
FROM
   tickets
   LEFT JOIN attendees ON
      attendees.id = tickets.attendee_id

answered Nov 01 '12 at 00:56

Kad

639
8
25

1

This is incorrect, just like the deleted answer by @Mahmoud. It would not consider identical revenues from two *different* attendees `DISTINCT` and report a smaller `revenueSum` than is correct. – Erwin Brandstetter Nov 01 '12 at 01:18
2

do NOT do this, it will sum all revenues that aren't the same, so if you have a revenue that is `50` and another that's also `50` it will only add the first one. – arg20 Mar 29 '18 at 23:57

Get the distinct sum of a joined table column

4 Answers4

How does it work?

Linked