PostgreSQL distinct rows joined with a count of distinct values in one column

Question

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:

  a  | b | u  | t 
-----+---+----+----
 foo | 1 |  1 | 10
 foo | 1 |  2 | 11
 foo | 1 |  2 | 11
 foo | 2 |  4 | 1
 foo | 3 |  5 | 2
 bar | 1 |  6 | 2
 bar | 2 |  7 | 2
 bar | 2 |  8 | 3
 bar | 3 |  9 | 4
 bar | 4 | 10 | 5
 bar | 5 | 11 | 6
 baz | 1 | 12 | 1
 baz | 1 | 13 | 2
 baz | 1 | 13 | 2
 baz | 1 | 13 | 3

There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.

I'm trying to build a query which will return the following results:

  a  | b | u  | t  | z 
-----+---+----+----+---
 foo | 1 |  1 | 10 | 3
 foo | 1 |  2 | 11 | 3
 foo | 2 |  4 | 1  | 3
 foo | 3 |  5 | 2  | 3
 bar | 1 |  6 | 2  | 5
 bar | 2 |  7 | 2  | 5
 bar | 2 |  8 | 3  | 5
 bar | 3 |  9 | 4  | 5
 bar | 4 | 10 | 5  | 5
 bar | 5 | 11 | 6  | 5

In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.

I can get just the z filter working as follows:

SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
  HAVING COUNT(b) > 2;

However, I'm stumped on combining this with the rest of the data in the table.

What's the most efficient way to do this?

Erwin Brandstetter · Accepted Answer · 2020-12-30T10:35:51.360

Your first step can be simpler already:

SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM   t
GROUP  BY 1
HAVING count(DISTINCT b) > 2;

Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.

Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b, u, and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).

Your desired end result:

SELECT DISTINCT ON (md5(t.a), b, u, t)
       t.a, t.b, t.u, t.t, a.z
FROM  (
   SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
   FROM   t
   GROUP  BY 1
   HAVING count(DISTINCT b) > 2
   ) a
JOIN   t ON md5(t.a) = md5_a
ORDER  BY 1, 2, 3, 4;  -- optional

Or probably faster, yet:

SELECT a, b, u, t, z
FROM  (
   SELECT DISTINCT ON (1, 2, 3, 4)
          md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
   FROM   t
   ) t
JOIN  (
   SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
   FROM   t
   GROUP  BY 1
   HAVING count(DISTINCT b) > 2
   ) z USING (md5_a)
ORDER  BY 1, 2, 3, 4;  -- optional

Detailed explanation for DISTINCT ON:

Select first row in each GROUP BY group?

Does `DISTINCT ON ()` have any performance benefit over `Lateral Join`, or vice versa? — Abhishek Dalvi, Dec 30 '20 at 10:00
@AbhishekDalvi: Two very different tools. (I removed the parts of my answer now that were voided by question clarification.) Detailed assessment of `DISTINCT ON` performance in the linked answer: https://stackoverflow.com/a/7630564/939860. Details for `LATERAL`: https://stackoverflow.com/a/28557803/939860 — Erwin Brandstetter, Dec 30 '20 at 10:38

PostgreSQL distinct rows joined with a count of distinct values in one column

1 Answers1