I have a company
table and an industry
table, with a many-to-many relation table linking the two, named company_industry
. The company
table currently has approximately 750.000 rows.
Now I need a query that finds all of the unique city names for a given industry, in which there are at least one company. So basically I have to find all companies that are associated with a given industry and select the unique city names for these companies.
I can write queries that do this just fine, but not with the performance that I am looking for. In advance I was a bit skeptical about the performance because the city_name
column is of the type VARCHAR
. Unfortunately I do currently not have the liberty of being able to change the database schema to something more normalized.
The first thing I did was to add an index on the city_name
column, and then I tried the below queries.
SELECT c.city_name AS city
FROM industry AS i
INNER JOIN company_industry AS ci ON (ci.industry_id = i.id)
INNER JOIN company AS c ON (c.id = ci.company_id)
WHERE i.id = 288
GROUP BY city;
The above query takes around two seconds to execute on average. The same is the case when replacing GROUP BY
with DISTINCT
. Below is the execution plan of the above query.
HashAggregate (cost=56934.21..56961.61 rows=2740 width=9) (actual time=2421.364..2421.921 rows=1962 loops=1)
-> Hash Join (cost=38972.69..56902.50 rows=12687 width=9) (actual time=954.377..2411.194 rows=12401 loops=1)
Hash Cond: (ci.company_id = c.id)
-> Nested Loop (cost=0.28..13989.91 rows=12687 width=4) (actual time=0.041..203.442 rows=12401 loops=1)
-> Index Only Scan using industry_pkey on industry i (cost=0.28..8.29 rows=1 width=4) (actual time=0.015..0.018 rows=1 loops=1)
Index Cond: (id = 288)
Heap Fetches: 0
-> Seq Scan on company_industry ci (cost=0.00..13854.75 rows=12687 width=8) (actual time=0.020..199.087 rows=12401 loops=1)
Filter: (industry_id = 288)
Rows Removed by Filter: 806309
-> Hash (cost=26036.52..26036.52 rows=744152 width=13) (actual time=954.113..954.113 rows=744152 loops=1)
Buckets: 4096 Batches: 64 Memory Usage: 551kB
-> Seq Scan on company c (cost=0.00..26036.52 rows=744152 width=13) (actual time=0.008..554.662 rows=744152 loops=1)
Total runtime: 2422.185 ms
I tried to change the query to use a subquery as below, which made the query roughly twice as fast.
SELECT c.city_name
FROM company AS c
WHERE EXISTS(
SELECT 1
FROM company_industry
WHERE industry_id = 288 AND company_id = c.id
)
GROUP BY c.city_name;
And the execution plan for this query:
HashAggregate (cost=47108.71..47136.11 rows=2740 width=9) (actual time=1270.171..1270.798 rows=1962 loops=1)
-> Hash Semi Join (cost=14015.50..47076.98 rows=12690 width=9) (actual time=194.548..1251.785 rows=12401 loops=1)
Hash Cond: (c.id = company_industry.company_id)
-> Seq Scan on company c (cost=0.00..26036.52 rows=744152 width=13) (actual time=0.008..537.856 rows=744152 loops=1)
-> Hash (cost=13856.88..13856.88 rows=12690 width=4) (actual time=194.399..194.399 rows=12401 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 436kB
-> Seq Scan on company_industry (cost=0.00..13856.88 rows=12690 width=4) (actual time=0.012..187.449 rows=12401 loops=1)
Filter: (industry_id = 288)
Rows Removed by Filter: 806309
Total runtime: 1271.030 ms
That's better, but hopefully you guys can help me do better.
Basically, the expensive part of the query seems to be finding the unique city names (as expected), and even with an index on the column, the performance is not quite good enough. I am quite rusty in regards to analyzing execution plans, but I included them so you guys can see exactly what is happening.
What can I do to retrieve this data faster?
I am using Postgres 9.3.5, DDL below:
CREATE TABLE company (
id SERIAL PRIMARY KEY NOT NULL,
name VARCHAR(150) NOT NULL,
city_name VARCHAR(50),
);
CREATE TABLE company_industry (
company_id INT NOT NULL REFERENCES company (id) ON UPDATE CASCADE,
industry_id INT NOT NULL REFERENCES industry (id) ON UPDATE CASCADE,
PRIMARY KEY (company_id, industry_id)
);
CREATE TABLE industry (
id SERIAL PRIMARY KEY NOT NULL,
name VARCHAR(100) NOT NULL
);
CREATE INDEX company_city_name_index ON company (city_name);