Your query looks good already. Use a plain [INNER] JOIN
instead or LEFT [OUTER] JOIN
, like Gordon suggested. But that won't change much.
You mention that table B
has only ...
a max of hundred rows
while table A
has ...
thousands of rows
If there are many rows per created_by
(which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)
Essential ingredient is this multicolumn index:
CREATE INDEX ON a (org_id, created_by);
It can replace a simple index on just (org_id)
and works for your simple query as well. See:
There are two complications for your case:
DISTINCT
- 0-n
org_id
resulting from org_name like '%myorg%'
So the optimization is harder to implement. But still possible with some fancy SQL:
SELECT count(DISTINCT created_by) -- does not count NULL (as desired)
FROM b
CROSS JOIN LATERAL (
WITH RECURSIVE t AS (
( -- parentheses required
SELECT created_by
FROM a
WHERE org_id = b.org_id
ORDER BY created_by
LIMIT 1
)
UNION ALL
SELECT (SELECT created_by
FROM a
WHERE org_id = b.org_id
AND created_by > t.created_by
ORDER BY created_by
LIMIT 1)
FROM t
WHERE t.created_by IS NOT NULL -- stop recursion
)
TABLE t
) a
WHERE b.org_name LIKE '%myorg%';
db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)
That's a recursive CTE in a LATERAL
subquery, using a correlated subquery.
It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by)
. With an index-only scans if the table is vacuumed enough.
The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.
Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id
and/or only few rows per created_by
) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.
Related: