Incremental DISTINCT / GROUP BY operation

Question

I have a simple two-stage SQL query that operators on two tables A and B, where I use a sub-select to retrieve a number of IDs of table A that are stored as foreign keys in B, using a (possibly complex) query on table B (and possibly other joined tables). Then, I want to simply return the first x IDs of A. I tried using a query like this:

SELECT sq.id
FROM  (
    SELECT a_id AS id, created_at
    FROM   B
    WHERE  ...
    ORDER  BY created_at DESC
    ) sq 
GROUP BY sq.id
ORDER BY max(sq.created_at) DESC
LIMIT 10;

which is quite slow as Postgres seems to perform the GROUP BY / DISTINCT operation on the whole result set before limiting it. If I LIMIT the sub-query (e.g. to 100), the performance is just fine (as I'd expect), but of course it's no longer guaranteed that there will be at least 10 distinct a_id values in the resulting rows of sq.

Similarly, the query

SELECT a_id AS id
FROM   B
WHERE  ...
GROUP  BY id
ORDER  BY max(created_at) DESC
LIMIT  10

is quite slow as Postgres seems to perform a sequential scan on B instead of using an (existing) index. If I remove the GROUP BY clause it uses the index just fine.

The data in table B is such that most rows contain different a_ids, hence even without the GROUP BY most of the returned IDs will be different. The goal I pursue with the grouping is to assure that the result set always contains a given number of entries from A.

Is there a way to perform an "incremental DISTINCT / GROUP BY"? In my naive thinking it would suffice for Postgres to produce result rows and group them incrementally until it reaches the number specified by LIMIT, which in most cases should be nearly instantaneous as most a_id values are different. I tried various ways to query the data but so far I didn't find anything that works reliably.

The Postgres version is 9.6, the data schema as follows:

                              Table "public.a"
 Column |       Type        |                   Modifiers                    
--------+-------------------+------------------------------------------------
 id     | bigint            | not null default nextval('a_id_seq'::regclass)
 bar    | character varying | 
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)
    "ix_a_bar" btree (bar)
Referenced by:
    TABLE "b" CONSTRAINT "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

                                      Table "public.b"
   Column   |            Type             |                    Modifiers                     
------------+-----------------------------+--------------------------------------------------
 id         | bigint                      | not null default nextval('b_id_seq'::regclass)
 foo        | character varying           | 
 a_id       | bigint                      | not null
 created_at | timestamp without time zone | 
Indexes:
    "b_pkey" PRIMARY KEY, btree (id)
    "ix_b_created_at" btree (created_at)
    "ix_b_foo" btree (foo)
Foreign-key constraints:
    "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

Please include an abbreviated schema and some sample data. Also, show the full query, as these details matter. — Tim Biegeleisen, Oct 24 '16 at 13:57
`distinct` is ***NOT*** a function. Writing `distinct (id)` makes no sense. — , Oct 24 '16 at 13:59
Please read [postgresql-performance](http://stackoverflow.com/tags/postgresql-performance/info) then **[edit]** your question and provide the missing information. — , Oct 24 '16 at 14:00
There may be a way. It's still unclear *which* 10 rows you want to retrieve exactly? Arbitrary / random pick? Or the first 10 sorted by `a_id` or by some other criteria? The two queries you display are not equivalent. Also: *always* declare your version of Postgres. And add the relevant definition of table `B` and relevant indexes you have. — Erwin Brandstetter, Oct 24 '16 at 14:49
Yes, I want to retrieve just the first 10 rows. I suppose you mean that in the first query there is no ORDER BY in the outer statement? I should probably add the order by field to the inner query and then add a `ORDER BY` in the outer one. — ThePhysicist, Oct 24 '16 at 14:55
The first x distinct elements, sorted by e.g. max(order_by_field). — ThePhysicist, Oct 24 '16 at 14:57
.. which can be simplified to *first x distinct elements, sorted by `order_by_field`* — Erwin Brandstetter, Oct 24 '16 at 15:02
I edited the question and added a data schema, hope this clarifies it. My actual data schema is more complicated so I'm not sure if this simplified schema will reproduce the situation accurately though. — ThePhysicist, Oct 24 '16 at 15:04
The best solution also depends on the nature of your `WHERE` clause, most importantly on *how selective* it can be, and whether we can use an index for it ... — Erwin Brandstetter, Oct 24 '16 at 15:25
I understand. It's just surprising that Postgres is not able to find a good query plan for this, as removing the GROUP BY clause (and accepting the possibility of duplicates in the `a_id` field) yields a very good query plan that uses all the correct indexes. — ThePhysicist, Oct 24 '16 at 15:28
And do you really just need `a_id`, or the whole row containing it? — Erwin Brandstetter, Oct 24 '16 at 15:31

score 1 · Answer 1 · answered Oct 24 '16 at 14:07

1

The only way the planner has a chance to avoid sorting the whole table is if you have an index on the complete ORDER BY clause.

Then an index scan can be chosen to get the correct ordering, and the first ten result rows may be found quickly.

answered Oct 24 '16 at 14:07

Laurenz Albe

209,280
17
206
263

score 1 · Accepted Answer · edited May 23 '17 at 10:30

This problem is much more complex than it might seem at a first glance.

If ...

your criteria are not very selective (much more than 10 distinct a_id qualify)
you don't have many duplicate a_id in table B (like you stated)

then there is a very fast way.

To simplify a bit I assume created_at is also defined NOT NULL, or you need to do more.

WITH RECURSIVE top10 AS (
   ( -- extra parentheses required
   SELECT a_id, ARRAY[a_id] AS id_arr, created_at
   FROM   b
   WHERE  ...  -- your other filter conditions here
   ORDER  BY created_at DESC, a_id DESC  -- both NOT NULL
   LIMIT  1
   )
   UNION ALL -- UNION ALL, not UNION, since we exclude dupes a priori
   (
   SELECT b.a_id, id_arr || b.a_id, b.created_at
   FROM   top10 t
   JOIN   b ON (b.created_at, b.a_id)
             < (t.created_at, t.a_id)  -- comparing ROW values
           AND  b.a_id <> ALL (t.id_arr)
   WHERE  ... -- repeat conditions
   ORDER  BY created_at DESC, a_id DESC
   LIMIT  1
   )
   )
SELECT a_id
FROM   top10
LIMIT  10;

Ideally supported by an index on (created_at DESC, a_id DESC) (or just (created_at, a_id)).

Depending on your other WHERE conditions, other (partial?) indexes may serve even better.

This is particularly efficient for a small result set. Else, and depending on various other details, other solutions may be faster.

Related (with much more explanation):

Great, I'll check it out! I also thought about using a recursive CTE to generate the rows but I wondered if there is a simpler way to do it. — ThePhysicist, Oct 24 '16 at 15:47
@ThePhysicist: My first query had a bug, I replaced it with a working version. — Erwin Brandstetter, Oct 24 '16 at 18:36

Incremental DISTINCT / GROUP BY operation

2 Answers2