SQL: efficiently select multiple rows groups from the same table

Question

Consider the following table:

create table user_tasks
(
    id                      bigserial            primary key,
    type                    varchar(255)         not null,
    ...
);

I need to select 20 rows per page, but equally spread by type. Say, there are 10 distinct types, so I'm gonna need 2 rows of each. Every type has a different subset of filters that can be applied to it.

Here is how I do it right now, pseudo code:

array types = [...];

Collection typesSql = new Collection();
foreach (type of types) {
 typesSql.push('SELECT * FROM user_tasks WHERE type =: type AND ... LIMIT 10');
}

string sql = typesSql->implode(' UNION ALL ');

array result = DB::query(sql);

I select 10 rows of each type, because I don't know in advance if there are enough rows of this particular type in the database, then I reduce the overall number from 100 to 20 if needed.

The query is quite slow, probably because I union all the 10 different queries. Is it possible to do it in a more efficient way? For example, is it possible to write query that selects a new type only if there are less than < 20 rows selected from all the previous types?

https://use-the-index-luke.com/sql/partial-results/fetch-next-page — jarlh, Aug 16 '23 at 14:24

score 3 · Answer 1 · answered Aug 08 '23 at 12:08

3

I believe this can be done in 2 steps:

Step 1: Count the number of groups (type column in your case)
Step 2: Rank the observations with relevance to type grouping

I'll try my best to express the idea with SQL queries:

select 
    count(distinct type) as types_num
from user_tasks

We can store the number of types in a variable, say types_num;

Then, we can start the main part of the query:

select
    id,
    rank() over (partition by type) as ranked_ids
from user_tasks
having
    rank() over (partition by type) <= DIV(20, types_num) -- integer division

The DIV part is for integer division, so the number of observations from each group is representative, given the limit of 20 rows per page

answered Aug 08 '23 at 12:08

Niqua

386
2
15

Hi! thanks for the idea, though It feels like even worse solution, because there are millions of rows in the table, so the count query would be a disaster. – Majesty Aug 09 '23 at 06:33
"there are millions of rows in the table, so the count query would be a disaster." Not if it is a columnar database. – Thiago Mata Aug 21 '23 at 00:07
@Nique so the count(distinct type) takes 4s 802ms, it does not seem like an improvement – Majesty Aug 21 '23 at 07:29
but despite that your approach is missing the important part - there are type-specific filters. I am not criticize you, just trying to figure it out, thanks! – Majesty Aug 21 '23 at 07:55

score 1 · Answer 2 · answered Aug 14 '23 at 19:34

Doing pagination on million rows table will never be fast, you need the row_number on the partition by type and you need to sort the result on it to be able to paginate, PostgreSQL-15 having NTILE() you can use it to attribute a number that will help the sorting:

select type, rn
from (
    select type, row_number() over(partition by type order by id) as rn
    -- the parameter of the ntile() should be the count(*) of the table
    -- divided by the number of rows of each type you want on the page
    , ntile(:ntiles) over(order by type, id) as tile
  from user_tasks
)
order by rn, tile, type
offset (:page_num-1)*:page_size rows fetch next :page_size rows only
;

Here a fiddle (for Oracle, just because easier to generate test rows with connect by...) https://dbfiddle.uk/qTa1M1RM

score 1 · Answer 3 · answered Aug 17 '23 at 12:47

CREATE TABLE user_tasks (
    id                      bigserial            PRIMARY KEY,
    type                    varchar(255)         NOT NULL
);

INSERT INTO user_tasks (type) VALUES ('type1'), ('type1'), ('type1'), ('type1'), ('type1');
INSERT INTO user_tasks (type) VALUES ('type2'), ('type2'), ('type2'), ('type2'), ('type2');
INSERT INTO user_tasks (type) VALUES ('type3'), ('type3'), ('type3'), ('type3'), ('type3');
INSERT INTO user_tasks (type) VALUES ('type4'), ('type4'), ('type4'), ('type4'), ('type4');
INSERT INTO user_tasks (type) VALUES ('type5'), ('type5'), ('type5'), ('type5'), ('type5');
INSERT INTO user_tasks (type) VALUES ('type6'), ('type6'), ('type6'), ('type6'), ('type6');
INSERT INTO user_tasks (type) VALUES ('type7'), ('type7'), ('type7'), ('type7'), ('type7');
INSERT INTO user_tasks (type) VALUES ('type8'), ('type8'), ('type8'), ('type8'), ('type8');
INSERT INTO user_tasks (type) VALUES ('type9'), ('type9'), ('type9'), ('type9'), ('type9');
INSERT INTO user_tasks (type) VALUES ('type10'), ('type10'), ('type10'), ('type10'), ('type10');

WITH RankedTasks AS (
  SELECT
    *,
    row_number() OVER (PARTITION BY type ORDER BY type) AS rn
  FROM user_tasks
 -- WHERE -- Your filters here
)

SELECT *
FROM RankedTasks
WHERE rn <= 2
LIMIT 20;

id	type	rn
5	type1	1
4	type1	2
50	type10	1
49	type10	2
6	type2	1
7	type2	2
11	type3	1
12	type3	2
16	type4	1
20	type4	2
21	type5	1
22	type5	2
26	type6	1
27	type6	2
31	type7	1
32	type7	2
36	type8	1
37	type8	2
45	type9	1
44	type9	2

fiddle

score 1 · Accepted Answer · answered Aug 17 '23 at 22:21

Since Every type has a different subset of filters, you are destined to union several queries, like you showed in description.

Then, for the pagination to be efficient, you'll need to trace the page position for each type. You want to select 3 rows per type, to display 2 first rows and start the next page from the position of the 3rd row.

Let's display most recent records using id field to order records in descending order. If you need other criteria, you'll use corresponding unique index.

Starting page per type:

SELECT ...
FROM user_tasks
WHERE ... AND type = :typeN
ORDER BY id DESC
LIMIT 3

Normally this should use the id primary index to start scanning from the end of the table and quickly fetch the 3 corresponding rows, but may take another execution plan if filtered condition is also indexed and/or types are not evenly distributed. Use EXPLAIN to find out and address if needed.

Then you UNION ALL for all the types, like you showed above. Don't forget to parenthesize individual queries since you have ORDER BY / LIMIT clauses inside.

The final query for the 1st page looks like this:

(SELECT ... WHERE ... type = :type1 ...)
UNION ALL
(SELECT ... WHERE ... type = :type2 ...)
UNION ALL
...
UNION ALL
(SELECT ... WHERE ... type = :type10 ...)

Then, with 10 types and 3 rows per type you should have 30 rows returned. As mentioned, you display 2 rows of a type and keep the 3rd for the Show more link. This link will contain 10 ids to give position where to continue the next page per type.

If for specific type you have less then 3 rows, it means this type is at the end and you'll pass NULL as next position por this type. If you have 10 NULLs to pass, all rows are exhausted and no link should be displayed.

Next page per type:

SELECT ...
FROM user_tasks
WHERE ... AND type = :typeN AND id <= :idN
ORDER BY id DESC
LIMIT 3

Only AND id <= :idN part is added to each Nth type request.

This technique will be sufficient (and fastest) if only Show more link at the end of the list is acceptable.

If you need pagination with more classical (First-1-2-3-...-Last), adapt accordingly, but it will need to scan the whole table to build the Last link per type, so maybe a caching solution will be needed to store pages' references and avoid frequent scans.

Ziya Mert Karakas · Answer 5 · 2023-08-21T09:48:18.503

WITH RECURSIVE RowNumCTE AS (
  SELECT
    id,
    type,
    ROW_NUMBER() OVER (PARTITION BY type ORDER BY id) AS rn
  FROM
    user_tasks
  WHERE
  /* filters */
)
, RecursiveCTE AS (
  SELECT
    id,
    type,
    rn,
    1 AS page,
    ROW_NUMBER() OVER (ORDER BY rn) AS global_rn
  FROM
    RowNumCTE
  WHERE
    rn = 1
  UNION ALL
  SELECT
    r.id,
    r.type,
    r.rn,
    CASE WHEN rc.page = 20 THEN 1 ELSE rc.page + 1 END AS page,
    r.global_rn
  FROM
    RowNumCTE r
  JOIN
    RecursiveCTE rc ON r.type = rc.type AND r.rn = rc.rn + 1
  WHERE
    rc.page < 20
)
SELECT
  id,
  type
FROM
  RecursiveCTE
ORDER BY
  global_rn;

We use 2 CTEs here. We use the ROW_NUMBER() window function to assign a row number (rn) to each row within its corresponding type partition. This helps us order the rows within each type based on a specific ordering criteria (in this example, id). The result is stored in a common table expression (CTE) called RowNumCTE.

Then RecursiveCTE iteratively selects rows while distributing them equally across types. The recursive CTE is responsible for generating a "global row number" (global_rn) that helps us maintain a single ordering across all types.

Then, the main query outside the recursive CTE selects the id and type columns from the RecursiveCTE while ordering the results based on the global_rn. The global_rn ensures that the selected rows are ordered sequentially across all types, even though they were selected in a distributed manner.

The rn value within each type partition helps us select rows within each type based on the specified ordering (e.g. id).

The global_rn value ensures that the selected rows are ordered sequentially across all types, helping us distribute the rows equally.

Note: Using CTEs recursively is expensive, so your query will take some time to execute

Steve Broberg · Answer 6 · 2023-08-17T16:30:59.520

The feature you're looking for is called Window Functions on Postgres. In a nutshell, they allow you to create "windows" (that work somewhat like GROUP BY expressions) where you can perform operations like first, last, previous, next etc. You can see some examples here and documentation here.

For you example schema (if we assume another column called "value" that you would use to find, say, the 2 rows with the largest value), the following query would do what you want:

  SELECT *
    FROM ( SELECT user_tasks.*
                , RANK() 
                  OVER (PARTITION BY type
                            ORDER BY value DESC
                       ) r
             FROM user_tasks
         ) ranked
   WHERE r <= 2;

Of course, this only gives the first "page" you want. It's unclear to me exactly how you want this to behave as you move through the result set. Do you want every type to be represented on each page? Do you want only the 10 most populous types on the early pages (limited to 2 rows per type), until those types are exhausted at which point you start showing results from the less populated types? This seems like a weird requirement.

How are the results even sorted? Do you want just random values from each type?

Requiring a different subset of filters for each type pretty much guarantees this will be impossible to do with a single sql statement, unless you can have a separate table that contains the filter conditions for each type and you apply them to the type data by joining to this filter table. As it stands, you query requirements need to be more formally defined.

SQL: efficiently select multiple rows groups from the same table

6 Answers6