Count on join of big tables with conditions is slow

Question

This query had reasonable times when the table was small. I'm trying to identify what's the bottleneck, but I'm not sure how to analyze the EXPLAIN results.

SELECT
  COUNT(*)
FROM performance_analyses
INNER JOIN total_sales ON total_sales.id = performance_analyses.total_sales_id
WHERE
  (size > 0) AND
  total_sales.customer_id IN (
    SELECT customers.id FROM customers WHERE customers.active = 't'
    AND customers.visible = 't' AND customers.organization_id = 3
  ) AND
  total_sales.product_category_id IN (
    SELECT product_categories.id FROM product_categories
    WHERE product_categories.organization_id = 3
  ) AND
  total_sales.period_id = 193;

I've tried both the approach of INNER JOIN'ing customers and product_categories tables and doing an INNER SELECT. Both had the same time.

Here's the link to EXPLAIN: https://explain.depesz.com/s/9lhr

Postgres version:

PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit

Tables and indexes:

CREATE TABLE total_sales (
  id serial NOT NULL,
  value double precision,
  start_date date,
  end_date date,
  product_category_customer_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  processed boolean,
  customer_id integer,
  product_category_id integer,
  period_id integer,
  CONSTRAINT total_sales_pkey PRIMARY KEY (id)
);
CREATE INDEX index_total_sales_on_customer_id ON total_sales (customer_id);
CREATE INDEX index_total_sales_on_period_id ON total_sales (period_id);
CREATE INDEX index_total_sales_on_product_category_customer_id ON total_sales (product_category_customer_id);
CREATE INDEX index_total_sales_on_product_category_id ON total_sales (product_category_id);
CREATE INDEX total_sales_product_category_period ON total_sales (product_category_id, period_id);
CREATE INDEX ts_pid_pcid_cid ON total_sales (period_id, product_category_id, customer_id);


CREATE TABLE performance_analyses (
  id serial NOT NULL,
  total_sales_id integer,
  status_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  size double precision,
  period_size integer,
  nominal_variation double precision,
  percentual_variation double precision,
  relative_performance double precision,
  time_ago_max integer,
  deseasonalized_series text,
  significance character varying,
  relevance character varying,
  original_variation double precision,
  last_level double precision,
  quantiles text,
  range text,
  analysis_method character varying,
  CONSTRAINT performance_analyses_pkey PRIMARY KEY (id)
);
CREATE INDEX index_performance_analyses_on_status_id ON performance_analyses (status_id);
CREATE INDEX index_performance_analyses_on_total_sales_id ON performance_analyses (total_sales_id);


CREATE TABLE product_categories (
  id serial NOT NULL,
  name character varying,
  organization_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  external_id character varying,
  CONSTRAINT product_categories_pkey PRIMARY KEY (id)
);
CREATE INDEX index_product_categories_on_organization_id ON product_categories (organization_id);


CREATE TABLE customers (
  id serial NOT NULL,
  name character varying,
  external_id character varying,
  region_id integer,
  organization_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  active boolean DEFAULT false,
  visible boolean DEFAULT false,
  segment_id integer,
  "group" boolean,
  group_id integer,
  ticket_enabled boolean DEFAULT true,
  CONSTRAINT customers_pkey PRIMARY KEY (id)
);
CREATE INDEX index_customers_on_organization_id ON customers (organization_id);    
CREATE INDEX index_customers_on_region_id ON customers (region_id);
CREATE INDEX index_customers_on_segment_id ON customers (segment_id);

Rows counts:

customers - 6,970 rows
product_categories - 34 rows
performance_analyses - 1,012,346 rows
total_sales - 7,104,441 rows

What happens if you move `AND "total_sales"."period_id" = 193` into the join: `INNER JOIN "total_sales" ON "total_sales"."id" = "performance_analyses"."total_sales_id" AND "total_sales"."period_id" = 193` — Jacob Mulquin, Jul 06 '16 at 23:14
@mulquin doesn't change as well. Strange thing: the first INNER SELECT (on customers table) when run by itself returns 5k rows and takes 500ms. I think it's taking too much for a simple query without any joins on a table with 6k rows. — João Daniel, Jul 06 '16 at 23:18
Definitely too long. Do the active, visible and organization_id columns have indexes? — Jacob Mulquin, Jul 06 '16 at 23:21
@mulquin organization_id does have. The columns active and visible are booleans, so the cardinality is very low I didn't considered an index. Should the have? — João Daniel, Jul 06 '16 at 23:23
Try to add an index for total_sales with the columns `total_sales (product_category_id, period_id)` (in a combined index, not seperate indexes as you have). You can switch the order, depending on your data. — Solarflare, Jul 06 '16 at 23:23
@JoãoDaniel Here is a question that relates to indexing char(1) columns: http://stackoverflow.com/questions/18576415/mysql-indexing-char1-columns — Jacob Mulquin, Jul 06 '16 at 23:27
" I think it's taking too much for a simple query without any joins on a table with 6k rows." You are doing 2 implicit joins with the "in" statement (at least it has the same effect as joins). — Solarflare, Jul 06 '16 at 23:31
@Solarflare I assumed that he meant this query was taking too long: `SELECT "customers"."id" FROM "customers" WHERE "customers"."active" = 't' AND "customers"."visible" = 't' AND "customers"."organization_id" = 3` — Jacob Mulquin, Jul 06 '16 at 23:33
@mulquin you're correct. This inner select is the one taking 500ms. Full query takes 2000ms. — João Daniel, Jul 06 '16 at 23:42
Another correction: the simple inner SELECT isn't taking 500ms. Part of it was network overhead. The query itself is around 10ms what's reasonable. However the complete query still takes 2000ms what's too much :( — João Daniel, Jul 07 '16 at 00:02
@Solarflare the index you suggested improved a lot! Time went from 2000ms to 500ms. Thanks! I'm not sure if it's a reasonable time though. — João Daniel, Jul 07 '16 at 00:08
schema please with `show create table xyz` for each xyz table above. Also the output from `explain` would not be a bad idea. Oops, old tag info, you have this on `postgresql` now, nevermind — Drew, Jul 07 '16 at 00:49
If it's reasonable depends on your requirements. You can always create more indexes that will speed up select queries more, while slowing down updates/inserts and using resources. But try: `performance_analyses (total_sales_id, size)`; then `total_sales (product_category_id, period_id, customer_id)` or `total_sales (customer_id, period_id, product_category_id)` (your numbers suggest the latter); lastly an index on `customer(organization_id, active, visible)` - but probably with lowest impact. But be careful not to microoptimize too much (e.g. 4 indexes for a query you run once a month). — Solarflare, Jul 07 '16 at 02:16
In situations like this it often helps to make secondary "cache" tables that store counts, provided that the values aren't updated too frequently. — Alkanshel, Oct 12 '17 at 18:30

Erwin Brandstetter · Accepted Answer · 2019-03-18T01:51:42.863

Your query, rewritten and 100 % equivalent:

SELECT count(*)
FROM   product_categories   pc 
JOIN   customers            c  USING (organization_id) 
JOIN   total_sales          ts ON ts.customer_id = c.id
JOIN   performance_analyses pa ON pa.total_sales_id = ts.id
WHERE  pc.organization_id = 3
AND    c.active  -- boolean can be used directly
AND    c.visible
AND    ts.product_category_id = pc.id
AND    ts.period_id = 193
AND    pa.size > 0;

Another answer advises to move all conditions into join clauses and order tables in the FROM list. This may apply for a certain other RDBMS with a comparatively primitive query planner. But while it doesn't hurt for Postgres either, it also has no effect on performance for your query - assuming default server configuration. The manual:

Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN) is semantically the same as listing the input relations in FROM, so it does not constrain the join order.

Bold emphasis mine. There is more, read the manual.

The key setting is join_collapse_limit (with default 8). The Postgres query planner will rearrange your 4 tables any way it expects it to be fastest, no matter how you arranged your tables and whether you write conditions as WHERE or JOIN clauses. No difference whatsoever. (The same is not true for some other types of joins that cannot be rearranged freely.)

The important point is that these different join possibilities give semantically equivalent results but might have hugely different execution costs. Therefore, the planner will explore all of them to try to find the most efficient query plan.

Finally, WHERE id IN (<subquery>) is not generally equivalent to a join. It does not multiply rows on the left side for duplicate matching values on the right side. And columns of the subquery are not visible for the rest of the query. A join can multiply rows with duplicate values and columns are visible.
Your simple subqueries dig up a single unique column in both cases, so there is no effective difference in this case - except that IN (<subquery>) is generally (at least a bit) slower and more verbose. Use joins.

Your query

Indexes

product_categories has 34 rows. Unless you plan on adding many more, indexes do no help performance for this table. A sequential scan will always be faster. Drop ~~index_product_categories_on_organization_id~~.

customers has 6,970 rows. Indexes start to make sense. But your query uses 4,988 of them according to the EXPLAIN output. Only an index-only scan on an index much less wide than the table could help a bit. Assuming WHERE active AND visible are constant predicates, I suggest a partial multicolumn index:

CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;

I appended id to allow index-only scans. The column is otherwise useless in the index for this query.

total_sales has 7,104,441 rows. Indexes are very important. I suggest:

CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)

Again, aiming for an index-only scan. This is the most important one.

You can delete the completely redundant index ~~index_total_sales_on_product_category_id~~.

performance_analyses has 1,012,346 rows. Indexes are very important. I would suggest another partial index with the condition size > 0:

CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;

However:

Rows Removed by Filter: 0"

Seems like this conditions serves no purpose? Are there any rows with size > 0 is not true?

After creating these indexes you need to ANALYZE the tables.

Tables statistics

Generally, I see many bad estimates. Postgres underestimates the number of rows returned at almost every step. The nested loops we see would work much better for fewer rows. Unless this is an unlikely coincidence, your table statistics are badly outdated. You need to visit your settings for autovacuum and probably also per-table settings for your two big tables performance_analyses and total_sales.

You already did run VACUUM and ANALYZE, which made the query slower, according to your comment. That doesn't make a lot of sense. I would run VACUUM FULL on these two tables once (if you can afford an exclusive lock). Else try pg_repack.
With all the fishy statistics and bad plans I would consider running a complete vacuumdb -fz yourdb on your DB. That rewrites all tables and indexes in pristine conditions, but it's no good to use on a regular basis. It's also expensive and will lock your DBs for an extended period of time!

While being at it, have a look at the cost settings of your DB as well. Related:

score 1 · Answer 2 · answered Jul 07 '16 at 18:20

Although theoretically the optimizer should be able to do this, I often find that these changes can massively improve performance:

use proper joins (instead of where id in (select ...))
order the reference to tables in the from clause such that the fewest rows are returned at each join, especially the first table's condition (in the where clause) should be the most restrictive (and should use indexes)
move all conditions on joined tables into the on condition of joins

Try this (aliases added for readability):

select count(*)
from total_sales ts
join product_categories pc on ts.product_category_id = pc.id and pc.organization_id = 3
join customers c on ts.customer_id = c.id and c.organization_id = 3
join performance_analyses pa on ts.id = pa.total_sales_id and pa.size > 0
where ts.period_id = 193

You will need to create this index for optimal performance (to allow an index-only scan on total_sales):

create index ts_pid_pcid_cid on total_sales(period_id, product_category_id, customer_id)

This approach first narrows the data to a period, so it will scale (remain roughly constant) into the future, because the number of sales per period will be roughly constant.

I just added more information about the tables. Unfortunately, it didn't changed the running time. Does the new information about table sizes changes anything? Thanks! (although it didn't improved the time, the three points you listed really improved my understading!) — João Daniel, Jul 07 '16 at 20:38
@JoãoDaniel try running `vacuum; analyze total_sales; analyze product_categories; analyze customers; analyze performance_analyses;` then retrying the query. — Bohemian, Jul 07 '16 at 20:47
Oh gosh, it increased the execution time from 500ms to 1000ms. Does it make sense? — João Daniel, Jul 07 '16 at 22:57
@JoãoDaniel not much sense, no. Just a guess: It could be that previously data was cached in memory. Try running a lot of similar queries (for different periods/organizations) then retry timings. — Bohemian, Jul 07 '16 at 23:49

score 0 · Answer 3 · answered Jul 07 '16 at 05:41

0

The estimations there are not accurate. Postgres's planner uses wrongly nested loop - try to penalize nest_loop by statement set enable_nestloop to off.

answered Jul 07 '16 at 05:41

Pavel Stehule

42,331
5
91
94

Count on join of big tables with conditions is slow

3 Answers3

Your query

Indexes

Tables statistics

Linked