SQL with multiple joins causing duplicates

Question

I am trying to make this query with multiple left joins, but returns duplicate updates and scientists for each charge associated with the project id (ex. if there are 5 charges then each update and scientist is returned 5 times). I'm trying to avoid multiple select statements but have been having trouble with this.

SELECT
  projects.*,
  coalesce(json_agg(updates ORDER BY update_date DESC) FILTER (WHERE updates.id IS NOT NULL), '[]') AS updates,
  coalesce(json_agg(scientists) FILTER (WHERE scientists.user_id IS NOT NULL), '[]') AS scientists,
  coalesce(SUM(charges.amount), 0) AS donated,
  coalesce(COUNT(charges), 0) AS num_donations
FROM projects
LEFT JOIN updates
ON updates.project_id = projects.id
LEFT JOIN scientists
ON scientists.project_id = projects.id
LEFT JOIN charges
ON charges.project_id = projects.id
WHERE projects.id = '${id}'
GROUP BY projects.id;

Expected results (changed to only return ids):

                  id                  |                   updates                |             scientists             | donated | num_donations 
--------------------------------------+------------------------------------------+------------------------------------+---------+---------------
 17191850-9a03-482f-9afe-7dc6b69974ea | ["0c29417f-0afb-44df-a8cf-24dc5cc7962c"] | ["auth0|5efcfb5f652e5a0019ce2193"] |     155 |             5

Actual Results:

                  id                  |                                                                                                 updates                                                                                                  |                                                                                 scientists                                                                                 | donated | num_donations 
--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+---------------
 17191850-9a03-482f-9afe-7dc6b69974ea | ["0c29417f-0afb-44df-a8cf-24dc5cc7962c", "0c29417f-0afb-44df-a8cf-24dc5cc7962c", "0c29417f-0afb-44df-a8cf-24dc5cc7962c", "0c29417f-0afb-44df-a8cf-24dc5cc7962c", "0c29417f-0afb-44df-a8cf-24dc5cc7962c"] | ["auth0|5efcfb5f652e5a0019ce2193", "auth0|5efcfb5f652e5a0019ce2193", "auth0|5efcfb5f652e5a0019ce2193", "auth0|5efcfb5f652e5a0019ce2193", "auth0|5efcfb5f652e5a0019ce2193"] |     155 |             5

provide sample data and your expected output in tabular format — Fahmi, Sep 12 '20 at 05:29
"If There are 5 charges then each scientist is returned 5 times" - so restrict to only the 1 charge you want. This is called a Cartesian product. The db is giving you what you ask for; you have to be more precise what you ask for. You didn't tell us which 1 charge you want (latest, highest, group all into a csv) so we can't know what to do any more than the database knows what to do — Caius Jard, Sep 12 '20 at 06:02
@CaiusJard I don't think I understand correctly. I want all scientists and updates to be returned for the specified project id, but there is only one distinct update and scientist in the DB associated with that project id. When I add the join with charges (with X distinct charges associated with the project id) the scientists and updates are duplicated X times. — Kaden Badalian, Sep 12 '20 at 06:20
Of course they are. How would you expect a rectangular results grid to look otherwise? You have 1 project, 1 projectscientist, 1 projectupdate and 5 projectcharges. If you join project, scientist and update you will get 1 row. When you add charges in the database MUST repeat the project/scientist/update data 5 times, one repetition per each of the 5 charges. If only the latest charge is relevant then you make that part of the query and because only one of the five charge rows applies you will get one row again — Caius Jard, Sep 12 '20 at 06:32
This last bit is the critical part, really - if the complaint is "I get 5 rows when I only want 1" then you *have to* throw some detail away. That is done either by discarding 4 of the charge rows because they're irrelevant or by picking some factor about the charges that you actually only want to know such as "the price/voltage of highest charge" or "the count of the charges" or "the average time taken to charge up the battery" - some function that can be applied to a set of data and produce a single value, we group on the other stuff and apply the aggregation to the charges — Caius Jard, Sep 12 '20 at 06:41
@CaiusJard Hmm interesting I didn't realize it would work that way. Since more I want to get a sum and count of all charges, would you suggest just doing nested select statements for the updates, scientists, and charges? — Kaden Badalian, Sep 12 '20 at 06:43
ps; you can't group by one column and then specify other, non grouped, non aggregated columns in an SQL query; it doesn't make sense to do it. In a group scenario data must either be part of the key that uniquely identifies the group or part of the aggregation applied to the group. You cannot `SELECT a,b,c FROM table GROUP BY a` — Caius Jard, Sep 12 '20 at 06:45

Caius Jard · Answer 1 · 2020-09-12T07:51:42.973

If you have this:

SELECT p.column, s.column, u.column
FROM 
  p 
  JOIN s ON ...
  JOIN u ON ...

And it produces one row

p1, s1, u1

And then you join another table in:

SELECT p.column, s.column, u.column, c.column
FROM 
  p 
  JOIN s ON ...
  JOIN u ON ...
  JOIN c ON ...

And it suddenly produces 5 rows..

p1, s1, u1, c1
p1, s1, u1, c2
p1, s1, u1, c3
p1, s1, u1, c4
p1, s1, u1, c5

And you want it to produce one row again but with another column with a count of 5:

p1, s1, u1, 5

Then you need to group the repeating data and add a count:

SELECT p.column, s.column, u.column, count(*)
FROM 
  p 
  JOIN s ON ...
  JOIN u ON ...
  JOIN c ON ...
GROUP BY p.column, s.column, u.column

You'll note that the GROUP BY section is just an exact repeat of the SELECT section, minus the count (an aggregate column)

The database will group the data up according to the key specified in the GROUP BY. p1, s1, u1 is a unique combination and is associated with 5 different c1 .. c5 values. The aggregation in this case doesn't apply to the cX data (because it's count(*), but it could - if we were to say:

SELECT p.column, s.column, u.column, min(c.column), max(c.column)

Then the DB makes this data set together with a bucket that contains all the c values:

p1, s1, u1, [c1, c2, c3, c4, c5]

And applies the MIN and MAX functions to the [c1, c2, c3, c4, c5] bucket pulling c1 and c5 respectively

In your mind, get used to seeing grouping operations as preparing the unique combination of columns in the group by, plus having all these other items of data in a big unordered bucket, and the MAX/MIN/AVG etc functions operate on the bucket contents and pull the relevant data (which could come from any row, and naturally MIN and MAX will probably pull from different rows). Grouping loses the notion of "this input row" because it prepares a new set of rows

In most typical grouping situations in various DBs you can't use SELECT * if you're grouping - you list out every one of the columns in the SELECT and again in the GROUP BY. This might seem redundant (and indeed some databases allow you to skip providing a group by) but it's possible in advanced scenarios to group by different things than you select so it's only redundant in the simple case

Now, hopefully you're down with all that above. Some databases have functions that aren't just MIN/MAX etc but will concatenate all the results in the bucket. Something like this pseudoSQL:

SELECT p.column, s.column, u.column, STRING_JOIN(c.column, '|')

Could produce:

p1, s1, u1, c1|c2|c3|c4|c5

the string_join function is designed to concat all the things in the bucket, together using the pipe char specified as a delimiter..

But remember that our original data was:

p1, s1, u1, c1
p1, s1, u1, c2
p1, s1, u1, c3
p1, s1, u1, c4
p1, s1, u1, c5

If we were to GROUP BY just p.column, the DB would do p1 as the keys and more buckets:

p1, [s1,s1,s1,s1,s1], [u1,u1,u1,u1,u1], [c1,c2,c3,c4,c5]

If you were to STRING_JOIN each of these you'd end up with what you asked for:

SELECT p.column, STRING_JOIN(s.column, '|'), STRING_JOIN(u.column, '|'), STRING_JOIN(c.column, '|'), 

p1, s1|s1|s1|s1|s1, u1|u1|u1|u1|u1, c1|c2|c3|c4|c5

There isn't anything AI in the DB that will look and say "i'll remove duplicates from the s and u buckets before I join" nor should there be. As I mentioned before all concept of rows and ordering is lost when data goes into a bucket for aggregation. If your data was:

p1, x1, y1
p1, x2, y2

And you grouped/joined you could end up with

p1, x1|x2, y2|y1

See the order of elements in the Y string is inverted compared to x - don't rely on "the order of elements in the set" to infer anything about e.g. the row they came from originally

So, what's going on with your query? Well, you're grouping by just one column and aggregating others, like above, so you can see how you get repetitions of the non grouped columns.

If you kept on grouping by all the columns then you'd have your single scientists and updates. If you desperately want them as JSON, then (assuming this really is postgres) you have to_json and row_to_json that will give a single json value, but it doesn't really add much that individual columns doesn't already give you. Postgres (if this is postgres) will allow you to GROUP BY * to let a json work:

SELECT p.column, row_to_json(s), row_to_json(u), count(*)
...
GROUP BY p.column, s.*, u.*

The presence of s.* and u.* will allow the row_to_json calls to produce the single row of json describing S and U, and the count will count the Cs

score 2 · Accepted Answer · answered Sep 12 '20 at 10:15

Your joins mulitply the rows, since there are multiple matches in several tables, as it has been thoroughly explained by Caius Jard.

A typical solution is to pre-aggregate in subqueries. For your use case, where you are filtering on just on project, lateral joins should be the most efficient option:

SELECT p.*, u.*, s.*, c.*
FROM projects
CROSS JOIN LATERAL (
    SELECT coalesce(json_agg(updates ORDER BY update_date DESC) FILTER (WHERE u.id IS NOT NULL), '[]') AS updates
    FROM updates u
    WHERE u.project_id = p.id
) u
CROSS JOIN LATERAL (
    SELECT coalesce(json_agg(scientists) FILTER (WHERE s.user_id IS NOT NULL), '[]') AS scientists
    FROM scientists s
    WHERE s.project_id = p.id
) s
CROSS JOIN LATERAL (
    SELECT coalesce(SUM(c.amount), 0) AS donated, coalesce(COUNT(charges), 0) AS num_donations
    FROM charges c
    WHERE c.project_id = p.id
) c ON TRUE
WHERE p.id = '${id}'

Thank you for this! I think I confused people by saying there was one distinct scientist and update when this isn't necessarily the case. Using the `CROSS JOINS` worked for me! — Kaden Badalian, Sep 12 '20 at 16:21

Erwin Brandstetter · Answer 3 · 2020-09-12T11:32:40.913

The basic problem is exactly the same as here:

Two SQL LEFT JOINS produce incorrect result

You later commented:

there is only one distinct update and scientist in the DB associated with that project id

If that's guaranteed to be true, all you need is to aggregate rows from table charges before you join:

SELECT p.*
     , COALESCE(to_json(u), '[]') AS updates
     , COALESCE(to_json(s), '[]') AS scientists
     , c.donated
     , c.num_donations
FROM   projects        p
LEFT   JOIN updates    u ON u.project_id = p.id
LEFT   JOIN scientists s ON s.project_id = p.id
CROSS  JOIN (
   SELECT COALESCE(SUM(amount), 0) AS donated
        , COUNT(*)    AS num_donations
   FROM   charges
   WHERE  project_id = '${id}'
   ) c
WHERE  p.id = '${id}'

The subquery on charges can be that simple because the only filter is the same ID as used in the outer query. We also do not need COALESCE() for the count because ...

... count() never returns NULL anyway. See:
- Query with LEFT JOIN not returning rows for count of 0
... the subquery (with aggregate functions and no GROUP BY) is guaranteed to return exactly one row, aggregating all qualifying rows - even if 0 rows qualify.

If there can be multiple related rows in the tables updates or scientists after all, aggregate in a similar fashion before you CROSS JOIN.

SQL with multiple joins causing duplicates

3 Answers3