Best way to get distinct count from a query joining two tables

Question

I have 2 tables, table A & table B.

Table A (has thousands of rows)

id
uuid
name
type
created_by
org_id

Table B (has a max of hundred rows)

org_id
org_name

I am trying to get the best join query to obtain a count with a WHERE clause. I need the count of distinct created_bys from table A with an org_name in Table B that contains 'myorg'. I currently have the below query (producing expected results) and wonder if this can be optimized further?

select count(distinct a.created_by)
from a left join
     b
     on a.org_id = b.org_id 
where b.org_name like '%myorg%';

Please don't change the question after answers have been given. Start a new question instead. I rolled back your change, you can find everything in the [revisions](https://stackoverflow.com/posts/61028864/revisions). — Erwin Brandstetter, Apr 04 '20 at 21:42
@ErwinBrandstetter, sure! [new q](https://stackoverflow.com/questions/61037164/best-way-to-get-distinct-count-from-a-query-joining-two-tables-multiple-join-po) — Niv, Apr 05 '20 at 02:38

score 1 · Accepted Answer · answered Apr 04 '20 at 13:33

1

You don't need a left join:

select count(distinct a.created_by)
from a join
     b
     on a.org_id = b.org_id
where b.org_name like '%myorg%'

For this query, you want an index on b.org_id, which I assume that you have.

answered Apr 04 '20 at 13:33

Gordon Linoff

1,242,037
58
646
786

score 1 · Answer 2 · answered Apr 04 '20 at 13:35

I would use exists for this:

select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')

An index on b(org_id) would help. But in terms of performance, key points are:

searching using like with a wildcard on both sides is not good for performance (this cannot take advantage of an index); it would be far better to search for an exact match, or at least to not have a wildcard on the left side of the string.
count(distinct ...) is more expensive than a regular count(); if you don't really need distinct, then don't use it.

`like` is applied in table `b` with max 100 rows, so hardly relevant to performance in this particular case. In other cases, a trigram index can be used: https://stackoverflow.com/a/13452528/939860 — Erwin Brandstetter, Apr 04 '20 at 17:42

Erwin Brandstetter · Answer 3 · 2020-04-04T17:46:46.420

Your query looks good already. Use a plain [INNER] JOIN instead or LEFT [OUTER] JOIN, like Gordon suggested. But that won't change much.

You mention that table B has only ...

a max of hundred rows

while table A has ...

thousands of rows

If there are many rows per created_by (which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)

Essential ingredient is this multicolumn index:

CREATE INDEX ON a (org_id, created_by);

It can replace a simple index on just (org_id) and works for your simple query as well. See:

Is a composite index also good for queries on the first field?

There are two complications for your case:

DISTINCT
0-n org_id resulting from org_name like '%myorg%'

So the optimization is harder to implement. But still possible with some fancy SQL:

SELECT count(DISTINCT created_by)  -- does not count NULL (as desired)
FROM   b
CROSS  JOIN LATERAL (
   WITH RECURSIVE t AS (
      (  -- parentheses required
      SELECT created_by
      FROM   a
      WHERE  org_id = b.org_id
      ORDER  BY created_by
      LIMIT 1
      )
      UNION ALL
      SELECT (SELECT created_by
              FROM   a
              WHERE  org_id = b.org_id
              AND    created_by > t.created_by
              ORDER  BY created_by
              LIMIT  1)
      FROM   t
      WHERE  t.created_by IS NOT NULL  -- stop recursion
      )
   TABLE t
   ) a
WHERE  b.org_name LIKE '%myorg%';

db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)

That's a recursive CTE in a LATERAL subquery, using a correlated subquery.

It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by). With an index-only scans if the table is vacuumed enough.

The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.

Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id and/or only few rows per created_by) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.

Best way to get distinct count from a query joining two tables

3 Answers3