Get paginated rows and total count in single query

Question

Core requirement:
Find latest entries for a person_id by submission_date for specified filter criteria type, plan, status. There could be more such filters, but the logic to return latest by submission date is the same regardless. Two major uses one for paginated viewing in UI and second for generating reports.

WITH cte AS (
  SELECT * FROM (
    SELECT my_table.*, rank() OVER (PARTITION BY person_id ORDER BY submission_date DESC, last_updated DESC, id DESC) FROM my_table
    )  rank_filter 
      WHERE RANK=1 AND status in ('ACCEPTED','CORRECTED') AND type != 'CR' AND h_plan_id IN (10000, 20000)
)
SELECT
SELECT count(id) FROM cte group by id,
SELECT * FROM cte limit 10 offset 0;

The group by also does not work on a CTE. a union with all null in the count query might work for combining probably, but not sure.

The main reason I want to combine these two into 1 query is because the table is big and the window function is expensive. Currently I use separate queries which both basically run the same query twice.

Postgres version 12.

\d my_table;
                               Table "public.my_table"
                 Column   |            Type             | Collation | Nullable 
--------------------------+-----------------------------+-----------+----------
 id                       | bigint                      |           | not null 
 h_plan_id                | bigint                      |           | not null 
 h_plan_submitter_id      | bigint                      |           |          
 last_updated             | timestamp without time zone |           |          
 date_created             | timestamp without time zone |           |          
 modified_by              | character varying(255)      |           |          
 segment_number           | integer                     |           |          

 -- <bunch of other text columns>

 submission_date          | character varying(255)      |           |          
 person_id                | character varying(255)      |           |          
 status                   | character varying(255)      |           |          
 file_id                  | bigint                      |           | not null 
Indexes:
    "my_table_pkey" PRIMARY KEY, btree (id)
    "my_table_file_idx" btree (file_id)
    "my_table_hplansubmitter_idx" btree (h_plan_submitter_id)
    "my_table_key_hash_idx" btree (key_hash)
    "my_table_person_id_idx" btree (person_id)
    "my_table_segment_number_idx" btree (segment_number)
Foreign-key constraints:
    "fk38njesaryvhj7e3p4thqkq7pb" FOREIGN KEY (h_plan_id) REFERENCES health_plan(id) ON UPDATE CASCADE ON DELETE CASCADE
    "fk6by9668sowmdob7433mi3rpsu" FOREIGN KEY (h_plan_submitter_id) REFERENCES h_plan_submitter(id) ON UPDATE CASCADE ON DELETE CASCADE
    "fkb06gpo9ng6eujkhnes0eco7bj" FOREIGN KEY (file_id) REFERENCES x12file(id) ON UPDATE CASCADE ON DELETE CASCADE

Additional information Possible values for type are EN and CR with EN being about 70% of the data. Table column widths select avg_width from pg_stats where tablename='mytable'; is a total of 374 for 41 columns so about 9 per col.

The idea is to show some pages upfront to user, they can then filter by additional parameters like file_name(each file usually has about 5k entries), type(very low cardinality), member_add_id(high cardinality), plan_id(low cardinality, every 500k to a million entries will be associated to a plan id). The business requirement in all cases is to show just the latest record for a certain set of plan id's for a submission_date(for reports it is done per year). The order by id was just defensive coding, the same day can have multiple entries and even if someone edited the second last entry hence touching the last_updated timestamp, we want to show only the very last entry of the same data. This probably never happens and can be removed.

User can use this data to generate csv reports.

Result of explain for the query with right join below:

 Nested Loop Left Join  (cost=554076.32..554076.56 rows=10 width=17092) (actual time=4530.914..4530.922 rows=10 loops=1)
   CTE cte
     ->  Unique  (cost=519813.11..522319.10 rows=495358 width=1922) (actual time=2719.093..3523.029 rows=422638 loops=1)
           ->  Sort  (cost=519813.11..521066.10 rows=501198 width=1922) (actual time=2719.091..3301.622 rows=423211 loops=1)
                 Sort Key: mytable.person_id, mytable.submission_date DESC NULLS LAST, mytable.last_updated DESC NULLS LAST, mytable.id DESC
                 Sort Method: external merge  Disk: 152384kB
                 ->  Seq Scan on mytable  (cost=0.00..54367.63 rows=501198 width=1922) (actual time=293.953..468.554 rows=423211 loops=1)
                       Filter: (((status)::text = ANY ('{ACCEPTED,CORRECTED}'::text[])) AND (h_plan_id = ANY ('{1,2}'::bigint[])) AND ((type)::text <> 'CR'::text))
                       Rows Removed by Filter: 10158
   ->  Aggregate  (cost=11145.56..11145.57 rows=1 width=8) (actual time=4142.116..4142.116 rows=1 loops=1)
         ->  CTE Scan on cte  (cost=0.00..9907.16 rows=495358 width=0) (actual time=2719.095..4071.481 rows=422638 loops=1)
   ->  Limit  (cost=20611.67..20611.69 rows=10 width=17084) (actual time=388.777..388.781 rows=10 loops=1)
         ->  Sort  (cost=20611.67..21850.06 rows=495358 width=17084) (actual time=388.776..388.777 rows=10 loops=1)
               Sort Key: cte_1.person_id
               Sort Method: top-N heapsort  Memory: 30kB
               ->  CTE Scan on cte cte_1  (cost=0.00..9907.16 rows=495358 width=17084) (actual time=0.013..128.314 rows=422638 loops=1)
 Planning Time: 0.369 ms
 JIT:
   Functions: 9
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.947 ms, Inlining 4.983 ms, Optimization 178.469 ms, Emission 110.261 ms, Total 295.660 ms
 Execution Time: 4587.711 ms

Please (always) declare your version of Postgres, and define "total count" more closely. And you mean `rank()`, not `row_number()`? (So possibly multiple rows per `person_id`.) An actual `CREATE TABLE` statement showing data types and constraints might help. — Erwin Brandstetter, Apr 11 '20 at 17:11
`submission_date` and `last_updated` are not defined `NOT NULL`. That can be an issue with your current `ORDER BY`. Should those really be `NOT NULL`? — Erwin Brandstetter, Apr 11 '20 at 17:37
Which rows do you want to return, ordered in which way, and how many of them? Many bits and pieces of info in your question, but no clear definition for the core requirement. If you want the latest row per `person_id` only if that has `status in ('ACCEPTED','CORRECTED')`, then your current approach is incorrect to begin with. You'd have to filter by `status` *after* distilling the latest row for each `person_id` ... — Erwin Brandstetter, Apr 13 '20 at 00:33
I have updated core requirement at top of question now. "If you want the latest row per person_id only if that has status in ('ACCEPTED','CORRECTED')" Yes this is correct. — Abe, Apr 13 '20 at 00:50

Erwin Brandstetter · Accepted Answer · 2020-04-14T17:58:45.263

First things first: you can use results from a CTE multiple times in the same query, that's a main feature of CTEs.) What you have would work like this (while still using the CTE once only):

WITH cte AS (
   SELECT * FROM (
      SELECT *, row_number()  -- see below
                OVER (PARTITION BY person_id
                      ORDER BY submission_date DESC NULLS LAST  -- see below
                             , last_updated DESC NULLS LAST  -- see below
                             , id DESC) AS rn
      FROM  tbl
      ) sub
   WHERE  rn = 1
   AND    status IN ('ACCEPTED', 'CORRECTED')
   )
SELECT *, count(*) OVER () AS total_rows_in_cte
FROM   cte
LIMIT  10
OFFSET 0;  -- see below

Caveat 1: `rank()`

rank() can return multiple rows per person_id with rank = 1. DISTINCT ON (person_id) (like Gordon provided) is an applicable replacement for row_number() - which works for you, as additional info clarified. See:

Select first row in each GROUP BY group?

Caveat 2: `ORDER BY submission_date DESC`

Neither submission_date nor last_updated are defined NOT NULL. Can be an issue with ORDER BY submission_date DESC, last_updated DESC ... See:

PostgreSQL sort by datetime asc, null first?

Should those columns really be NOT NULL?

You replied:

Yes, all those columns should be non-null. I can add that constraint. I put it as nullable since we get data in files which are not always perfect. But this is very rare condition and I can put in empty string instead.

Empty strings are not allowed for type date. Keep the columns nullable. NULL is the proper value for those cases. Use NULLS LAST as demonstrated to avoid NULL being sorted on top.

Caveat 3: `OFFSET`

If OFFSET is equal or greater than the number of rows returned by the CTE, you get no row, so also no total count. See:

Run a query with a LIMIT/OFFSET and also get the total number of rows

Interim solution

Addressing all caveats so far, and based on added information, we might arrive at this query:

WITH cte AS (
   SELECT DISTINCT ON (person_id) *
   FROM   tbl
   WHERE  status IN ('ACCEPTED', 'CORRECTED')
   ORDER  BY person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC
   )
SELECT *
FROM  (
   TABLE  cte
   ORDER  BY person_id  -- ?? see below
   LIMIT  10
   OFFSET 0
   ) sub
RIGHT  JOIN (SELECT count(*) FROM cte) c(total_rows_in_cte) ON true;

Now the CTE is actually used twice. The RIGHT JOIN guarantees we get the total count, no matter the OFFSET. DISTINCT ON should perform OK-ish for the only few rows per (person_id) in the base query.

But you have wide rows. How wide on average? The query will likely result in a sequential scan on the whole table. Indexes won't help (much). All of this will remain hugely inefficient for paging. See:

Optimize query with OFFSET on large table

You cannot involve an index for paging as that is based on the derived table from the CTE. And your actual sort criteria for paging is still unclear (ORDER BY id ?). If paging is the goal, you desperately need a different query style. If you are only interested in the first few pages, you need a different query style, yet. The best solution depends on information still missing in the question ...

Radically faster

For your updated objective:

Find latest entries for a person_id by submission_date

(Ignoring "for specified filter criteria, type, plan, status" for simplicity.)

And:

Find the latest row per person_id only if that has status IN ('ACCEPTED','CORRECTED')

Based on these two specialized indices:

CREATE INDEX ON tbl (submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST)
WHERE  status IN ('ACCEPTED', 'CORRECTED'); -- optional

CREATE INDEX ON tbl (person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST);

Run this query:

WITH RECURSIVE cte AS (
   (
   SELECT t  -- whole row
   FROM   tbl t
   WHERE  status IN ('ACCEPTED', 'CORRECTED')
   AND    NOT EXISTS (SELECT FROM tbl
                      WHERE  person_id = t.person_id 
                      AND   (  submission_date,   last_updated,   id)
                          > (t.submission_date, t.last_updated, t.id)  -- row-wise comparison
                      )
   ORDER  BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
   LIMIT  1
   )

   UNION ALL
   SELECT (SELECT t1  -- whole row
           FROM   tbl t1
           WHERE ( t1.submission_date, t1.last_updated, t1.id)
               < ((t).submission_date,(t).last_updated,(t).id)  -- row-wise comparison
           AND    t1.status IN ('ACCEPTED', 'CORRECTED')
           AND    NOT EXISTS (SELECT FROM tbl
                              WHERE  person_id = t1.person_id 
                              AND   (   submission_date,    last_updated,    id)
                                  > (t1.submission_date, t1.last_updated, t1.id)  -- row-wise comparison
                              )
           ORDER  BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (t).id IS NOT NULL
   )
SELECT (t).*
FROM   cte
LIMIT  10
OFFSET 0;

Every set of parentheses here is required.

This level of sophistication should retrieve a relatively small set of top rows radically faster by using the given indices and no sequential scan. See:

Optimize GROUP BY query to retrieve latest row per user

submission_date should most probably be type timestamptz or date, not ~~character varying(255)~~ - which is an odd type definition in Postgres in any case. See:

Refactor foreign key to fields

Many more details might be optimized, but this is getting out of hands. You might consider professional consulting.

Important Caveat! good to know. I have updated the table creation information also. If you have any additional insights would be glad to hear! — Abe, Apr 11 '20 at 17:28
yes row_number() should work equally well. If used, how would the query change? — Abe, Apr 11 '20 at 17:29
That's helpful already .. if performance optimization is your goal, we'd need some more, yet. See instructions here: https://stackoverflow.com/tags/postgresql-performance/info. Most importantly: cardinality and data distribution. Roughly, how many rows in your table, how many per `person_id`, how many accepted, corrected, ... — Erwin Brandstetter, Apr 11 '20 at 17:35
@Abe: out of time right now, I'll come back to this later tonight to check for updates. I think there is more potential here ... — Erwin Brandstetter, Apr 11 '20 at 17:45
Thanks so much for looking into this. Yes, all those columns should be non-null. I can add that constraint. I put it as nullable since we get data in files which are not always perfect. But this is very rare condition and i can put in empty string instead. — Abe, Apr 11 '20 at 17:49
Total num rows ~500,000 but expected to be in tens of millions. Cardinality of status column is low, 95% will be ACCEPTED, rest can be REJECTED, CORRECTED etc. each person_id repeats usually 2 to 5 times in table, but can be in low tens also. so its cardinality is ok. additional col *not* mentioned is `type` with very low cardinality only 3 distinct values in the whole table and we need to filter by that type also in each query. — Abe, Apr 11 '20 at 18:05
@Abe: We're getting there. But the picture is still incomplete. You mentioned paging, with ~ 5 rows per `person_id` and millions of rows, you'd get around a million of pages. Nobody pages that much. And `LIMIT` / `OFFSET` is hugely inefficient for that. What is the precise purpose / use case of the query? Will there be more filters? You *might* go for a `MATERIALIZED VIEW` with paging based on an additional index there ... Please consolidate your question with all relevant information scattered in comments, incl. `type`. — Erwin Brandstetter, Apr 11 '20 at 22:18
Edited post with more information. Still digesting all the good stuff from your answer..:) highly appreciated. — Abe, Apr 12 '20 at 01:01
Also order by person_id would not result in the latest CORRECTED records by submission date to show up first in the paged data. That is a biz requirement, so probably I will need to use the partition by itself? — Abe, Apr 12 '20 at 02:03

Gordon Linoff · Answer 2 · 2020-04-12T02:06:37.003

You might try simplifying the query:

SELECT DISTINCT ON (person_id) t.*
FROM my_table t
WHERE status in ('ACCEPTED', 'CORRECTED')
ORDER BY person_id, submission_date DESC, last_updated DESC, id DESC

I am not sure if Postgres is smart enough to use an index on (person_id, submission_date DESC, last_updated DESC, id DESC, status) in this case, but it is worth a try.

You could speed this using an index on an expression: (status in ('ACCEPTED', 'CORRECTED'), person_id, submission_date DESC, last_updated DESC, id DESC).

EDIT:

If you want to sort by another column, you can use a subquery:

SELECT t.*
FROM (SELECT DISTINCT ON (person_id) t.*
      FROM my_table t
      WHERE status in ('ACCEPTED', 'CORRECTED')
      ORDER BY person_id, submission_date DESC, last_updated DESC, id DESC
     ) t
ORDER BY submission_date DESC

Is there a way to get around the order by person_id. The latest records by submission date need to be shown first preferably. I tried putting it second, did not work. — Abe, Apr 12 '20 at 02:01

Get paginated rows and total count in single query

2 Answers2

Caveat 1: `rank()`

Caveat 2: `ORDER BY submission_date DESC`

Caveat 3: `OFFSET`

Interim solution

Radically faster

Linked

Get paginated rows and total count in single query

2 Answers2

Caveat 1: rank()

Caveat 2: ORDER BY submission_date DESC

Caveat 3: OFFSET

Interim solution

Radically faster

Linked

Caveat 1: `rank()`

Caveat 2: `ORDER BY submission_date DESC`

Caveat 3: `OFFSET`