What can I do to optimize my SQL query that applies to both Postgres and MySQL?

Question

My table DDL is:

CREATE TABLE accounts (
    id serial NOT NULL,
    active bool NULL DEFAULT true,
    created_at timestamp NULL,
    organization_id NULL,
    CONSTRAINT accounts_pkey PRIMARY KEY (id)
);
CREATE INDEX index_accounts_on_active ON accounts USING btree (active);
CREATE INDEX index_accounts_on_created_at ON accounts USING btree (created_at);
CREATE INDEX index_accounts_on_organization_id ON accounts USING btree (organization_id);

My table has 200k records and my query is:

select count(*) from account where active = false and organization_id in (3,2,20,30,99,69) and created_at >= '2016-09-03 15:29:54.541924';

Explain analyze return this query plan:

Finalize Aggregate  (cost=5791.37..5791.38 rows=1 width=8) (actual time=36.504..36.504 rows=1 loops=1)
  ->  Gather  (cost=5791.26..5791.37 rows=1 width=8) (actual time=36.411..38.788 rows=2 loops=1)
        Workers Planned: 1
        Workers Launched: 1
        ->  Partial Aggregate  (cost=4791.26..4791.27 rows=1 width=8) (actual time=31.313..31.313 rows=1 loops=2)
              ->  Parallel Seq Scan on accounts  (cost=0.00..4625.94 rows=66126 width=0) (actual time=0.073..26.518 rows=56057 loops=2)
                    Filter: ((NOT active) AND (created_at >= '2016-09-03 15:29:54.541924'::timestamp without time zone) AND (organization_id = ANY ('{3,2,20,30,99,69}'::integer[])))
                    Rows Removed by Filter: 43943
Planning Time: 0.293 ms
Execution Time: 38.863 ms

Thanks for any suggestions.

You have two answers. Which one provides the best performance? — Strawberry, Sep 06 '20 at 02:23
I am using Postgres at the moment but I'd like a solution that is compatible with both databases. — Hokiedood, Sep 08 '20 at 09:36

Gordon Linoff · Answer 1 · 2020-09-06T02:23:04.003

For this query:

select count(*)
from account
where active = false and
      organization_id in (3,2,20,30,99,69) and
      created_at >= '2016-09-03 15:29:54.541924';

I would recommend an index with three keys: (active, organization_id, created_at).

That said, the most efficient way to write the query across the two databases is probably:

select sum(cnt)
from ((select count(*) as cnt
       from account
       where active = false and
             organization_id = 3 and
             created_at >= '2016-09-03 15:29:54.541924'
      ) union all
      (select count(*) as cnt
       from account
       where active = false and
             organization_id = 2 and
             created_at >= '2016-09-03 15:29:54.541924'
      ) union all
      . . .
     ) a;

This can make full use of an index on (active, organization_id, created_at) or (organization_id, active, created_at).

score 0 · Answer 2 · answered Sep 06 '20 at 02:08

You need a composite index here to cover the entire WHERE clause. The following should work on either database:

CREATE INDEX idx ON account (organization_id, created_at, active);

I placed the more restrictive columns first, followed by least restrictive. That is, I am assuming few records would match your restrictions on organization_id, while significantly more would match active.

Note that while you do have indices on the above three columns, they are in separate indices. Most of the time (but not always), a database will choose to use only a single index to satisfy an execution plan. This means that, e.g., Postgres would probably choose to use neither of your indices, because no single one covers the entire WHERE clause.

No. Place the column(s) tested with `=` first. That will cluster the information better. — Rick James, Sep 06 '20 at 04:30
@RickJames I don't know about mysql, but postgresql uses the leading organization_id just fine, jumping to the correct created_at offset within each in-list value. (But due to the low selectivity, there is probably little point in doing it that way here). — jjanes, Sep 06 '20 at 17:39

score 0 · Answer 3 · answered Sep 06 '20 at 04:37

For MySQL, I say

INDEX(active, organization_id, created_at),
INDEX(active, created_at, organization_id)

The Optimizer will look at the statistics to decide which is probably faster.

Each is "covering". The optimizer will use the first two columns of the INDEX that it picks to do filtering, then finish off the filtering with the third column.

active must be first because it is tested with = and the other two are not.

Argument for = over "cardinality": Higher cardinality column first in an index when involving a range?

score 0 · Answer 4 · answered Sep 06 '20 at 17:40

0

Given that you are visiting over half the rows:

(actual) rows=56057
Rows Removed by Filter: 43943

There is little reason to think any index is going to be extremely useful, compared to just doing the seq scan.

answered Sep 06 '20 at 17:40

jjanes

37,812
5
27
34

What can I do to optimize my SQL query that applies to both Postgres and MySQL?

4 Answers4