Slow query with distinct/group by on varchar column with Postgres

Question

I have a company table and an industry table, with a many-to-many relation table linking the two, named company_industry. The company table currently has approximately 750.000 rows.

Database schema

Now I need a query that finds all of the unique city names for a given industry, in which there are at least one company. So basically I have to find all companies that are associated with a given industry and select the unique city names for these companies.

I can write queries that do this just fine, but not with the performance that I am looking for. In advance I was a bit skeptical about the performance because the city_name column is of the type VARCHAR. Unfortunately I do currently not have the liberty of being able to change the database schema to something more normalized.

The first thing I did was to add an index on the city_name column, and then I tried the below queries.

SELECT c.city_name AS city
FROM industry AS i 
INNER JOIN company_industry AS ci ON (ci.industry_id = i.id)
INNER JOIN company AS c ON (c.id = ci.company_id)
WHERE i.id = 288
GROUP BY city;

The above query takes around two seconds to execute on average. The same is the case when replacing GROUP BY with DISTINCT. Below is the execution plan of the above query.

HashAggregate  (cost=56934.21..56961.61 rows=2740 width=9) (actual time=2421.364..2421.921 rows=1962 loops=1)
  ->  Hash Join  (cost=38972.69..56902.50 rows=12687 width=9) (actual time=954.377..2411.194 rows=12401 loops=1)
        Hash Cond: (ci.company_id = c.id)
        ->  Nested Loop  (cost=0.28..13989.91 rows=12687 width=4) (actual time=0.041..203.442 rows=12401 loops=1)
              ->  Index Only Scan using industry_pkey on industry i  (cost=0.28..8.29 rows=1 width=4) (actual time=0.015..0.018 rows=1 loops=1)
                    Index Cond: (id = 288)
                    Heap Fetches: 0
              ->  Seq Scan on company_industry ci  (cost=0.00..13854.75 rows=12687 width=8) (actual time=0.020..199.087 rows=12401 loops=1)
                    Filter: (industry_id = 288)
                    Rows Removed by Filter: 806309
        ->  Hash  (cost=26036.52..26036.52 rows=744152 width=13) (actual time=954.113..954.113 rows=744152 loops=1)
              Buckets: 4096  Batches: 64  Memory Usage: 551kB
              ->  Seq Scan on company c  (cost=0.00..26036.52 rows=744152 width=13) (actual time=0.008..554.662 rows=744152 loops=1)
Total runtime: 2422.185 ms

I tried to change the query to use a subquery as below, which made the query roughly twice as fast.

SELECT c.city_name
FROM company AS c
WHERE EXISTS(
  SELECT 1
  FROM company_industry
  WHERE industry_id = 288 AND company_id = c.id
)
GROUP BY c.city_name;

And the execution plan for this query:

HashAggregate  (cost=47108.71..47136.11 rows=2740 width=9) (actual time=1270.171..1270.798 rows=1962 loops=1)
  ->  Hash Semi Join  (cost=14015.50..47076.98 rows=12690 width=9) (actual time=194.548..1251.785 rows=12401 loops=1)
        Hash Cond: (c.id = company_industry.company_id)
        ->  Seq Scan on company c  (cost=0.00..26036.52 rows=744152 width=13) (actual time=0.008..537.856 rows=744152 loops=1)
        ->  Hash  (cost=13856.88..13856.88 rows=12690 width=4) (actual time=194.399..194.399 rows=12401 loops=1)
              Buckets: 2048  Batches: 1  Memory Usage: 436kB
              ->  Seq Scan on company_industry  (cost=0.00..13856.88 rows=12690 width=4) (actual time=0.012..187.449 rows=12401 loops=1)
                    Filter: (industry_id = 288)
                    Rows Removed by Filter: 806309
Total runtime: 1271.030 ms

That's better, but hopefully you guys can help me do better.

Basically, the expensive part of the query seems to be finding the unique city names (as expected), and even with an index on the column, the performance is not quite good enough. I am quite rusty in regards to analyzing execution plans, but I included them so you guys can see exactly what is happening.

What can I do to retrieve this data faster?

I am using Postgres 9.3.5, DDL below:

CREATE TABLE company (
  id SERIAL PRIMARY KEY NOT NULL,
  name VARCHAR(150) NOT NULL,
  city_name VARCHAR(50),
);

CREATE TABLE company_industry (
  company_id INT NOT NULL REFERENCES company (id) ON UPDATE CASCADE,
  industry_id INT NOT NULL REFERENCES industry (id) ON UPDATE CASCADE,
  PRIMARY KEY (company_id, industry_id)
);

CREATE TABLE industry (
  id SERIAL PRIMARY KEY NOT NULL,
  name VARCHAR(100) NOT NULL
);

CREATE INDEX company_city_name_index ON company (city_name);

Please, always provide your version of Postgres, especially for performance optimization. Should go without saying. — Erwin Brandstetter, May 16 '15 at 16:49

Erwin Brandstetter · Accepted Answer · 2022-02-07T22:14:43.363

There is a Seq Scan on company_industry in both query plans that should really be a (bitmap) index scan. The same goes for Seq Scan on company.

Seems to be a issue of missing indexes - or something is not right in your db. If something seems wrong, draw a backup before you proceed. Check if cost settings and statistics are valid:

Keep PostgreSQL from sometimes choosing a bad query plan

If settings are good, I would then check the relevant indices (as detailed below). Maybe REINDEX can fix it:

REINDEX TABLE company;
REINDEX TABLE company_industry;

Maybe you need to do more:

Optimize Postgres query on timestamp range

Also, you can simplify the query:

SELECT c.city_name AS city
FROM   company_industry ci
JOIN   company          c ON c.id = ci.company_id
WHERE  ci.industry_id = 288
GROUP  BY 1;

Notes

If your PK constraint is on (company_id, industry_id) add another (unique) index on (industry_id, company_id) (reversed order!). Why?

Is a composite index also good for queries on the first field?

Seq Scan on company is equally bothersome. It seems like there is no index on company(id), but your ER diagram indicates a PK, so that cannot be?
The fastest option would be to have a multicolumn index on (id, city_name) - if (and only if) you get index-only scans out of it.

Since you already have the id of the given industry, you don't need to include the table industry table at all.

No need for parentheses around the expression(s) in the ON clause.

This is unfortunate:

Unfortunately I do currently not have the liberty of being able to change the database schema to something more normalized.

Your simple schema makes sense for small tables with little redundancy and hardly any strain on available cache memory. But city names are probably highly redundant in big tables. Normalizing would shrink table and index sizes considerably, which is the most important factor for performance.
A de-normalized form with redundant storage can sometimes speed up targeted queries, sometimes not, it depends. But it always affects everything else adversely. Redundant storage eats more of your available cache, so other data has to be evicted sooner. Even if you gain something locally, you may lose overall.
In this particular case it would also be considerably cheaper to get distinct values for a city_id int column, because integer values are smaller and faster to compare than (potentially long) strings. A multicolumn index on (id, city_id) in company would be smaller than the same for (id, city_name) and faster to process. One more join after folding many duplicates is comparatively cheap.

If you need top performance, you can always add a MATERIALIZED VIEW for the special purpose with pre-computed results (readily aggregated and with an index on industry_id), but avoid massively redundant data in your prime tables.

Thank you for an excellent answer! I will get back to you after checking up on some things as per your suggestions. Also, I totally agree with you in regards to the normalization of the table; if I could get away with altering the database structure right now, I would, but unfortunately it is not an option as of right now. Thanks again! — ba0708, May 16 '15 at 21:25
I followed your advice, and the following are my results. I reindexed the tables, analyzed them, and vacuumed them, without success. Adding an index to `company_industry` for `(industry_id, company_id)` (reverse order) does result in a bitmap index scan. When adding an index for `(id,city_name)` on the `company` table, it is an "Index Only Scan" on the new index. The query now executes in ~300ms because of these indexes, which is a great improvement. I can definitely live with this until I get the chance to normalize my table. Thank you very much! — ba0708, May 18 '15 at 11:44
If the indexes were missing, then there was no reason to suspect breakage. `REINDEX` etc. was only meant for that case. Either way, `REINDEX`, `ANALYZE, `VACUUM` don't hurt. — Erwin Brandstetter, May 18 '15 at 13:00

Mahmoud.Mahfouz · Answer 2 · 2015-05-16T19:00:33.787

-1

If you want this query in milli-seconds range then you should de-normalize your data by adding another column city_name to the junction table company_industry and index it.

this way you will only query on (un-tested) SELECT DISTINCT(c.city_name) FROM company_industry ci GROUP BY ci.industry_id HAVING COUNT(ci.company_id) >= 1

edited May 16 '15 at 19:00

answered May 16 '15 at 12:51

Mahmoud.Mahfouz

112
1
3

De-normalized storage may even be *slower*, not faster, because you are bloating the table with redundant data this way. More importantly, it would have adverse affects on multiple other aspects in the database. Also, the query does doe make much sense. `COUNT(ci.company_id >= 1)` is *always* true, `WHERE ci.industry_id = 288` is missing. – Erwin Brandstetter May 16 '15 at 16:23
so can you please tell me what are the "it would have adverse affects on multiple other aspects in the database." and I think you should read [facebook-timeline-brought-to-you-by-the-power-of-denormaliza](http://highscalability.com/blog/2012/1/23/facebook-timeline-brought-to-you-by-the-power-of-denormaliza.html) – Mahmoud.Mahfouz May 16 '15 at 19:03

Slow query with distinct/group by on varchar column with Postgres

2 Answers2

Notes