PostgreSQL find top N rows grouped by date

Question

I'm working on a typical blog app and have a view that returns the following data:

| post_id | title | publish_on | tag_id | tag_name |

| 1 | Why is Postgres awesome                | 2014-09-02 | 1    | tech |
| 1 | Why is Postgres awesome                | 2014-09-02 | 2    | postgres |
| 2 | How to ask a question on stackoverflow | 2014-09-10 | 1    | tech |
| 2 | How to ask a question on stackoverflow | 2014-09-10 | 2    | postgres |
| 2 | How to ask a question on stackoverflow | 2014-09-10 | 3    | guide |
| 3 | This is a draft                        | null       | null | null |
| 4 | This is something else without a tag   | 2014-10-10 | null | null |
| 5 | This question is also published on 9/2 | 2014-09-02 | null | null |
| 6 | And so is this                         | 2014-09-02 | 1    | tech |
| 7 | But this one is on 9/10                | 2014-09-10 | 3    | guide|
| 8 | This is on 10/10                       | 2014-10-10 | null | null |
| 9 | And so is this                         | 2014-10-10 | 2    | postgres |
| 10| This is another draft                  | null       | null | null |

I'm looking to group the posts by publish_on date and then select the top 3 posts for each bucket(this will be presented in the dashboard so that the user can know what posts will be published today, sometime next week and later) Now I've tried these solutions which use something like:

ROW_NUMBER() OVER (PARTITION BY publish_on ORDER BY publish_on DESC)

but since rows can be duplicated due to multiple tags, those queries fail. I've also tried a combination of various PARTION BY criteria, but I guess I don't understand them well enough to get it working.

Any help/pointers appreciated!

Update: Expected Output

For each publish_on date, I want to get N (3) posts expected to be published on that date.

| 1 | Why is Postgres awesome                | 2014-09-02 | 1    | tech |
| 1 | Why is Postgres awesome                | 2014-09-02 | 2    | postgres |
| 5 | This question is also published on 9/2 | 2014-09-02 | null | null |
| 6 | And so is this                         | 2014-09-02 | 1    | tech |

| 2 | How to ask a question on stackoverflow | 2014-09-10 | 1    | tech |
| 2 | How to ask a question on stackoverflow | 2014-09-10 | 2    | postgres |
| 2 | How to ask a question on stackoverflow | 2014-09-10 | 3    | guide |
| 7 | But this one is on 9/10                | 2014-09-10 | 3    | guide|

| 4 | This is something else without a tag   | 2014-10-10 | null | null |
| 8 | This is on 10/10                       | 2014-10-10 | null | null |
| 9 | And so is this                         | 2014-10-10 | 2    | postgres |

| 3 | This is a draft                        | null       | null | null |
| 10| This is another draft                  | null       | null | null |

Hope this makes the question a little more clear to understand.

are tag_id and tag_name related? I assume so, but tag_id has the related tag_name 'postgres' and 'guide'. — simo.3792, Sep 03 '14 at 04:13
@simo.3792095 sorry, they are related. Will update the table to reflect that. Thanks! — Checksum, Sep 03 '14 at 04:14
I assume post_id's 4, 8 & 10 will appear grouped on the 10th of october. You currently show post_id 4 under 10th of September. — simo.3792, Sep 03 '14 at 05:02
If you post scripts to create table and fill data it would be easier to test solutions — Abylay Sabirgaliyev, Sep 03 '14 at 05:46

simo.3792 · Accepted Answer · 2014-09-03T05:27:13.353

Is this what you are looking for? SQL Fiddle

SELECT * 
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY tag_name order by publish_on DESC) AS r,
    t.*
    from blog t ) x
where x.r <= 3

Explanation and questions:

I assume by "each bucket" you mean tag_name (or tag_id). Then you only want the 3 most recent post within "each bucket". If a post is tagged multiple times, then how do you want to treat them - appear once per tag - or appear only once per result set?

EDIT

This now displays the results as you expect. SQL Fiddle for this here.

SELECT DISTINCT x.Post_id, y.title, x.Publish_on, y.tag_id, y.tag_name
FROM blog y
INNER JOIN (SELECT ROW_NUMBER() OVER (PARTITION BY publish_on order by publish_on DESC) AS r,
    t.post_id, t.publish_on
    from (SELECT DISTINCT s.post_id, s.publish_on
          FROM blog s) t 
           ) x ON x.post_id = y.post_id
where x.r <= 3
ORDER BY x.publish_on

The main problem that is adding to your complexity, is that the table structure is not normalised. This should in fact be 3 tables, so that the descriptions and date will not be duplicated in different rows, ie

CREATE TABLE blog
(post_id int not null,
 title varchar(50) not null,
 publish_on date)

CREATE TABLE blog_tag
(post_id int not null,
 tag_ig int not null)

CREATE TABLE tag
(tag_id int not null,
 tag_name varchar(10) not null)

Then the SQL can be replaced with see full SQL Fidle for this here.

SELECT x.Post_id, x.title, x.Publish_on, t.tag_id, t.tag_name
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY publish_on order by publish_on DESC) AS r,
    b.*
    from blog b) x
LEFT JOIN blog_tag bt ON bt.post_id = x.post_id
LEFT JOIN tag t ON t.tag_id = bt.tag_id
WHERE x.r <= 3
ORDER BY x.publish_on, x.post_id, t.tag_id

I've updated the question with expected output. By bucket I meant "grouping of date". Can you take another look at the question? — Checksum, Sep 03 '14 at 04:30
Works great! The tables are actually normalized, and the sample data comes from a view. Both options you posted work, but since I have control over the view, I'm going to modify it accordingly to use option number 2. Thanks! — Checksum, Sep 03 '14 at 06:13

PostgreSQL find top N rows grouped by date

1 Answers1