52

For my problem, we have a schema whereby one photo has many tags and also many comments. So if I have a query where I want all the comments and tags, it will multiply the rows together. So if one photo has 2 tags and 13 comments, I get 26 rows for that one photo:

SELECT
        tag.name, 
        comment.comment_id
FROM
        photo
        LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
        LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
        LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id

enter image description here

That's fine for most things, but it means that if I GROUP BY and then json_agg(tag.*), I get 13 copies of the first tag, and 13 copies of the second tag.

SELECT json_agg(tag.name) as tags
FROM
        photo
        LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
        LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
        LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id

enter image description here

Instead I want an array that is only 'suburban' and 'city', like this:

 [
      {"tag_id":1,"name":"suburban"}, 
      {"tag_id":2,"name":"city"}
 ]

I could json_agg(DISTINCT tag.name), but this will only make an array of tag names, when I want the entire row as json. I would like to json_agg(DISTINCT ON(tag.name) tag.*), but that's not valid SQL apparently.

How then can I simulate DISTINCT ON inside an aggregate function in Postgres?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Migwell
  • 18,631
  • 21
  • 91
  • 160
  • 3
    `json_agg(DISTINCT tag.name)` should work for you, have you tried this? – PinnyM May 06 '15 at 13:15
  • Can you just remove the join to `Comments`? You aren't selecting on anything in that table. If you need to keep it because you have something from it in your `WHERE` clause, perhaps then add every `tag` field to your `GROUP BY` and don't use `DISTINCT` – JNevill May 06 '15 at 13:17
  • @PinnyM No because then json_agg just makes an array of tag names. I want an array of objects representing the whole row (my second query is just an example, in reality I want to `json_agg(tag.*)` – Migwell May 06 '15 at 13:17
  • @JNevill No I can't, because I also want to select `json_agg(comment.*) AS comments`. – Migwell May 06 '15 at 13:18
  • According to the docs, [json_agg "aggregates values as a JSON array"](http://www.postgresql.org/docs/9.3/static/functions-aggregate.html). If all you want is to turn the row into json, use the `row_to_json` function. – PinnyM May 06 '15 at 13:22
  • No, I want an array of rows. Which is what json_agg(comment.*) would do. I don't just want one json object. – Migwell May 06 '15 at 13:23
  • Can you please post exactly what you'd like the expected output to look like? Still not clear to me... – PinnyM May 06 '15 at 13:25
  • Updated the question to have an expected output – Migwell May 06 '15 at 13:30
  • You have presented your question clearly, but important information is still missing to determine the best query. 1st, *always* your version of Postgres. 2nd, retrieve the whole table, a buch of photos of just a single photo per query? Also, I doubt you really want `json_agg(comment.*)` that would include `photo_id` redundantly. Typically you'd want a subset without the redundant column, no? – Erwin Brandstetter Mar 18 '16 at 06:49
  • json_agg doesn't work with distinct. It says there is no equality operator in JSON – Joe Love May 19 '21 at 15:26

4 Answers4

46

The most simple thing I discovered is to use DISTINCT over jsonb (not json!). (jsonb_build_object creates jsonb objects)

SELECT 
   JSON_AGG(
       DISTINCT jsonb_build_object('tag_id', photo_tag.tag_id, 
                                  'name', tag.name)) AS tags
FROM photo
    LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
    LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
    LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id
Eugene Kovalev
  • 3,407
  • 1
  • 15
  • 17
  • 2
    Nice. Though is there a downside to using `jsonb` over `json`? – kyw Jan 03 '20 at 07:07
  • 3
    @kyw Please, have a look at [this article](https://www.compose.com/articles/faster-operations-with-the-jsonb-data-type-in-postgresql/). There are pros and cons of `jsonb` over `json` at the very beginning. As for me, the pros are really good :) – Eugene Kovalev Jan 03 '20 at 10:39
  • @EugeneKovalev Any doc or links that explain behaviour on DISTINCT in JSONB objects? DISTINCT seems to me calculating distinct on the entire JSONB object and not just any key in the JSONB object would like to know how it works though. – Surya Jul 01 '20 at 05:08
  • 1
    DISTINCT with jsonb_build_object() work like a champ. Loved it. – Mohammad Ashraful Islam Aug 23 '21 at 05:35
  • 1
    thanks mate, much easier than the answers above – Alveona Oct 06 '21 at 10:18
  • be sure to use `jsonb_build_object` and not `json_build_object` – ihebiheb Aug 05 '23 at 21:49
31

Whenever you have a central table and want to left-join it to many rows in table A and also left-join it to many rows in table B, you get these problems of duplicating rows. It can especially throw off aggregrate functions like COUNT and SUM if you're not careful! So I think you need to construct your tags-for-each-photo and comments-for-each-photo separately, and then join them together:

WITH tags AS (
  SELECT  photo.photo_id, json_agg(row_to_json(tag.*)) AS tags
  FROM    photo
  LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id
  LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
  GROUP BY photo.photo_id
),
comments AS (
  SELECT  photo.photo_id, json_agg(row_to_json(comment.*)) AS comments
  FROM    photo
  LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
  GROUP BY photo.photo_id
)
SELECT  COALESCE(tags.photo_id, comments.photo_id) AS photo_id,
        tags.tags,
        comments.comments
FROM    tags
FULL OUTER JOIN comments
ON      tags.photo_id = comments.photo_id

EDIT: If you really want to join everything together without CTEs, this looks like it gives correct results:

SELECT  photo.photo_id,
        to_json(array_agg(DISTINCT tag.*)) AS tags,
        to_json(array_agg(DISTINCT comment.*)) AS comments
FROM    photo
LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id
LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id
Paul A Jungwirth
  • 23,504
  • 14
  • 74
  • 93
28

The cheapest and simplest DISTINCT operation is ... not to multiply rows in a "proxy cross join" in the first place. Aggregate first, then join. See:

Best for returning few selected rows

Assuming you actually don't want to retrieve the whole table, but just one or few selected photos at a time, with aggregated details. Then LATERAL subqueries are fast and elegant:

SELECT *
FROM   photo p
CROSS  JOIN LATERAL (
   SELECT json_agg(c) AS comments
   FROM   comment c
   WHERE  photo_id = p.photo_id
   ) c1
CROSS  JOIN LATERAL (
   SELECT json_agg(t) AS tags
   FROM   photo_tag pt
   JOIN   tag       t USING (tag_id)
   WHERE  pt.photo_id = p.photo_id
   ) t
WHERE  p.photo_id = 2;  -- arbitrary selection

This returns whole rows from comment and tag, aggregated into JSON arrays separately. Rows are not multiplies like in your attempt, but they are only as "distinct" as they are in your base tables.

To additionally fold duplicates in the base data, see below.

Notes:

  • LATERAL and json_agg() require Postgres 9.3 or later.

  • json_agg(c) is short for json_agg(c.*).

  • We do not need to LEFT JOIN because an aggregate function like json_agg() always returns a row.

Typically, you'd only want a subset of columns - at least excluding the redundant photo_id:

SELECT *
FROM   photo p
CROSS  JOIN LATERAL (
   SELECT json_agg(json_build_object('comment_id', comment_id
                                   , 'comment', comment)) AS comments
   FROM   comment
   WHERE  photo_id = p.photo_id
   ) c
CROSS  JOIN LATERAL (
   SELECT json_agg(t) AS tags
   FROM   photo_tag pt
   JOIN   tag       t USING (tag_id)
   WHERE  pt.photo_id = p.photo_id
   ) t
WHERE  p.photo_id = 2;

json_build_object() was introduced with Postgres 9.4. Used to be cumbersome in older versions because a ROW constructor doesn't preserve column names. But there are generic workarounds:

Also allows to choose JSON key names freely, you don't have to stick to column names.

Best for returning all or most rows

SELECT p.*
     , COALESCE(c1.comments, '[]') AS comments
     , COALESCE(t.tags, '[]') AS tags
FROM   photo p
LEFT   JOIN (
   SELECT photo_id
        , json_agg(json_build_object('comment_id', comment_id
                                   , 'comment', comment)) AS comments
   FROM   comment c
   GROUP  BY 1
   ) c1 USING (photo_id)
LEFT  JOIN LATERAL (
   SELECT photo_id , json_agg(t) AS tags
   FROM   photo_tag pt
   JOIN   tag       t USING (tag_id)
   GROUP  BY 1
   ) t USING (photo_id);

Once we retrieve enough rows, this gets cheaper than LATERAL subqueries. Works for Postgres 9.3+.

Note the USING clause in the join condition. This way we can conveniently use SELECT * in the outer query without getting duplicate columns for photo_id. I didn't use SELECT * here because your deleted answer indicates you want empty JSON arrays instead of NULL for no tags / no comments.

Also remove existing duplicates in base tables

You can't just json_agg(DISTINCT json_build_object(...)) because there is no equality operator for the data type json. See:

There are various better ways:

SELECT *
FROM   photo p
CROSS  JOIN LATERAL (
   SELECT json_agg(to_json(c1.comment)) AS comments1
        , json_agg(json_build_object('comment', c1.comment)) AS comments2
        , json_agg(to_json(c1)) AS comments3
   FROM  (
      SELECT DISTINCT c.comment  -- folding dupes here
      FROM   comment c
      WHERE  c.photo_id = p.photo_id
   -- ORDER  BY comment --  any particular order?
      ) c1
   ) c2
CROSS  JOIN LATERAL (
   SELECT jsonb_agg(DISTINCT t) AS tags  -- demonstrating jsonb_agg
   FROM   photo_tag pt
   JOIN   tag       t USING (tag_id)
   WHERE  pt.photo_id = p.photo_id
   ) t
WHERE  p.photo_id = 2;

Demonstrating 4 different techniques in comments1, comments2, comments3 (redundantly) and tags.

db<>fiddle here
Old: sqlfiddle backpatched to Postgres 9.3; sqlfiddle for Postgres 9.6

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
4

As stated in comments, json_agg does not serialize a row as an object, but builds a JSON array of the values that you pass it. You'll need row_to_json to turn your row into a JSON object, and then json_agg to perform the aggregation to an array:

SELECT json_agg(DISTINCT row_to_json(comment)) as tags
FROM
    photo
    LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
    LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
    LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id
PinnyM
  • 35,165
  • 3
  • 73
  • 81
  • Actually json_agg does implicitly convert a row into JSON. Have a look at this example I just made http://pastebin.com/8jVSNstd. But the problem with your query is that because there is no JSON comparison operator, it can't find distinct rows because they're all JSON. – Migwell May 06 '15 at 14:04
  • 2
    If I copy your exact code and run it, I get `ERROR: could not identify an equality operator for type json`, which is what I'm referring to. – Migwell May 06 '15 at 14:05
  • @Miguel Hmm, have you tried calling DISTINCT on the row, like this: json_agg(DISTINCT tag)? If that doesn't work, you'll likely need a subquery to accomplish this. – PinnyM May 06 '15 at 17:42
  • 2
    That doesn't work because the row is implicitly converted to json and you get the same error. If you know how to do this with a sub query I'd love to know because I can't think of anything. – Migwell May 06 '15 at 20:52