5

Given the following 4 tables:

CREATE TABLE events ( id, name )
CREATE TABLE profiles ( id, event_id )
CREATE TABLE donations ( amount, profile_id )
CREATE TABLE event_members( id, event_id, user_id )

I'm attempting to get a list of all events, along with a count of any members, and a sum of any donations. The issue is the sum of donations is coming back wrong (appears to be a cartesian result of donations * # of event_members).

Here is the SQL query (Postgres)

SELECT events.name, COUNT(DISTINCT event_members.id), SUM(donations.amount)
FROM            events
LEFT OUTER JOIN profiles      ON events.id = profiles.event_id
LEFT OUTER JOIN donations     ON donations.profile_id = profiles.id
LEFT OUTER JOIN event_members ON event_members.event_id = events.id
GROUP BY events.name

The sum(donations.amount) is coming back = to the actual sum of donations * number of rows in event_members. If I comment out the count(distinct event_members.id) and the event_members left outer join, the sum is correct.

philipxy
  • 14,867
  • 6
  • 39
  • 83
John P
  • 1,540
  • 2
  • 22
  • 34
  • incidentally, changing sum(donations.amount) to be count(distinct donations.id) does result in the correct number of donations – John P Feb 05 '13 at 07:34
  • Hi, just reordered the `create table` statements in your question to mirror the `join` order. – biziclop Feb 05 '13 at 07:36
  • 1
    Closely related: http://stackoverflow.com/questions/12464037/two-sql-left-joins-produce-incorrect-result – Erwin Brandstetter Feb 05 '13 at 07:36
  • Hi, you could make your own solution into a new answer (and even accept it, if you like it the best :) – biziclop Feb 05 '13 at 07:57
  • Test performance with `EXPLAIN ANALYZE`. Correlated subqueries (like you have in your solution) are regularly much slower. It may still win if there are only few rows in the result out of many in the base table. – Erwin Brandstetter Feb 05 '13 at 08:00
  • With my fairly small sample result set (2 events, ~200 each of members,profiles and donations), the performance was within a few ms of each other. – John P Feb 05 '13 at 08:10

4 Answers4

5

As I explained in an answer to the referenced question you need to aggregate before joining to avoid a proxy CROSS JOIN. Like:

SELECT e.name, e.sum_donations, m.ct_members
FROM (
    SELECT e.id AS event_id, e.name, SUM(d.amount) AS sum_donations
    FROM   events         e
    LEFT   JOIN profiles  p ON p.event_id = e.id
    LEFT   JOIN donations d ON d.profile_id = p.id
    GROUP  BY 1, 2
    ) e
LEFT   JOIN (
    SELECT m.event_id, count(DISTINCT m.id) AS ct_members
    FROM   event_members m
    GROUP  BY 1
    ) m USING (event_id);

IF event_members.id is the primary key, then id is guaranteed to be UNIQUE in the table and you can drop DISTINCT from the count:

count(*) AS ct_members
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Why do you `GROUP BY` `event.id` and `event.name` in the first subselect? `event.name` probably depends on `event.id`. – biziclop Feb 05 '13 at 07:52
  • 1
    @biziclop: Because I may have to. Every `SELECT` item has to be either in the `GROUP BY` list or be used in an aggregate function. Since PostgreSQL 9.1 the primary key of a table covers all columns of that table, but it is not clear from the question that it *is* the primary key and we *are* running Postgres 9.1+. – Erwin Brandstetter Feb 05 '13 at 07:57
  • For my sample result set (2 events, 200 each in the other tables), all the solutions were taking +-3ms of each other. I do like the way this solution is structured though – John P Feb 05 '13 at 08:18
2

You seem to have this two independent structures (-[ means 1-N association):

events -[ profiles -[ donations
events -[ event members

I wrapped the second one into a subquery:

SELECT events.name,
  member_count.the_member_count
  COUNT(DISTINCT event_members.id),
  SUM(donations.amount)

FROM            events
LEFT OUTER JOIN profiles      ON events.id = profiles.event_id
LEFT OUTER JOIN donations     ON donations.profile_id = profiles.id

LEFT OUTER JOIN (
  SELECT
    event_id,
    COUNT(*) AS the_member_count
  FROM event_members
  GROUP BY event_id
) AS member_count
  ON member_count.event_id = events.id

GROUP BY events.name
biziclop
  • 14,466
  • 3
  • 49
  • 65
  • Warning: my query may not be a syntactically correct PostgreSQL query. – biziclop Feb 05 '13 at 07:59
  • This worked - up vote for novel approach putting a select statement in a join. Performance is almost exactly the same as my solution. For posterity, need to add member_count.the_member_count to the group by – John P Feb 05 '13 at 08:03
  • Are you sure that my answer is the better one and not Erwin's one? Feel free to unaccept my answer :) – biziclop Feb 05 '13 at 08:04
  • 1
    @JohnP: Not sure why you accepted this. The basic idea is on the right track, but the query is not valid. – Erwin Brandstetter Feb 05 '13 at 08:21
1

Of course you get a cartesian product between donations and events for every event since both are only bound to the event, there is no join relation between donations and event_members other than the event id, which of course means that every member matches every donation.

wich
  • 16,709
  • 6
  • 47
  • 72
0

When you do your query, you ask for all events - let's say there are two, event Alpha and event Beta - and then JOIN with the members. Let's say that there is a member Alice that participates on both events.

SELECT events.name, COUNT(DISTINCT event_members.id), SUM(donations.amount)
FROM            events
LEFT OUTER JOIN profiles      ON events.id = profiles.event_id
LEFT OUTER JOIN donations     ON donations.profile_id = profiles.id
LEFT OUTER JOIN event_members ON event_members.event_id = events.id
GROUP BY events.name

On each row you asked the total for Alice's donations. If Alice donated 100 USD, then you asked for:

Alpha  Alice  100USD
Beta   Alice  100USD

So it's not surprising that when asking for the sum total Alice comes out as having donated 200 USD.

If you wanted the sum of all donations, you'd better doing with two distinct queries. Trying to do everything with a single query, while possible, would be a classical SQL Antipattern (actually the one in chapter #18, "Spaghetti Query"):

Unintended Products

One common consequence of producing all your results in one query is a Cartesian product. This happens when two of the tables in the query have no condition restricting their relationship. Without such a restriction, the join of two tables pairs each row in the first table to every row in the other table. Each such pairing becomes a row of the result set, and you end up with many more rows than you expect.

LSerni
  • 55,617
  • 10
  • 65
  • 107
  • Unfortunately, the reporting system I'm working with can has to get all the results back from a single query. Embedding a sub-select in the column list did the trick – John P Feb 05 '13 at 07:50