I'm experiencing a 5x-10x slow down when adding one condition to a WHERE clause in a join. I have verified indexes are being used and this is a pretty simple query with 2 joins:
This query takes an .5 seconds:
EXPLAIN ANALYZE SELECT COUNT(*) FROM "businesses"
INNER JOIN categorizations ON categorizations.business_id = businesses.id
INNER JOIN postal_codes ON businesses.postal_code_id = postal_codes.id
WHERE categorizations.category_id IN (958,968,936)
AND lower(city) IN ('new york');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=60600.79..60600.80 rows=1 width=0) (actual time=741.224..741.225 rows=1 loops=1)
-> Hash Join (cost=321.63..60600.78 rows=2 width=0) (actual time=23.360..740.475 rows=795 loops=1)
Hash Cond: (businesses.postal_code_id = postal_codes.id)
-> Nested Loop (cost=184.63..60400.82 rows=16784 width=4) (actual time=19.200..690.901 rows=58076 loops=1)
-> Bitmap Heap Scan on categorizations (cost=184.20..17662.46 rows=16784 width=4) (actual time=19.164..131.991 rows=58076 loops=1)
Recheck Cond: (category_id = ANY ('{958,968,936}'::integer[]))
-> Bitmap Index Scan on categorizations_category_id (cost=0.00..180.00 rows=16784 width=0) (actual time=9.994..9.994 rows=58076 loops=1)
Index Cond: (category_id = ANY ('{958,968,936}'::integer[]))
-> Index Scan using businesses_pkey on businesses (cost=0.43..2.54 rows=1 width=8) (actual time=0.005..0.006 rows=1 loops=58076)
Index Cond: (id = categorizations.business_id)
-> Hash (cost=135.49..135.49 rows=121 width=4) (actual time=0.449..0.449 rows=150 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 6kB
-> Index Scan using idx_postal_codes_lower_city on postal_codes (cost=0.43..135.49 rows=121 width=4) (actual time=0.037..0.312 rows=150 loops=1)
Index Cond: (lower((city)::text) = 'new york'::text)
Total runtime: 741.321 ms
(15 rows)
But adding just one condition (the region) pushes the average to 4 seconds:
EXPLAIN ANALYZE SELECT COUNT(*) FROM "businesses"
INNER JOIN categorizations ON categorizations.business_id = businesses.id
INNER JOIN postal_codes ON businesses.postal_code_id = postal_codes.id
WHERE categorizations.category_id IN (958,968,936)
AND lower(city) IN ('new york') AND lower(region) = 'new york';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1312.76..1312.77 rows=1 width=0) (actual time=2879.764..2879.765 rows=1 loops=1)
-> Nested Loop (cost=16.77..1312.76 rows=1 width=0) (actual time=4.740..2878.936 rows=795 loops=1)
-> Nested Loop (cost=16.21..1281.22 rows=18 width=4) (actual time=2.259..780.067 rows=206972 loops=1)
-> Index Scan using idx_postal_codes_city_region_country on postal_codes (cost=0.43..2.65 rows=1 width=4) (actual time=0.052..0.432 rows=150 loops=1)
Index Cond: ((lower((city)::text) = 'new york'::text) AND (lower((region)::text) = 'new york'::text))
-> Bitmap Heap Scan on businesses (cost=15.78..1267.29 rows=1128 width=8) (actual time=0.377..3.179 rows=1380 loops=150)
Recheck Cond: (postal_code_id = postal_codes.id)
-> Bitmap Index Scan on index_businesses_on_postal_code_id (cost=0.00..15.49 rows=1128 width=0) (actual time=0.219..0.219 rows=1380 loops=150)
Index Cond: (postal_code_id = postal_codes.id)
-> Index Only Scan using index_categorizations_on_business_id_and_category_id_and_source on categorizations (cost=0.56..1.74 rows=1 width=4) (actual time=0.008..0.008 rows=0 loops=206972)
Index Cond: ((business_id = businesses.id) AND (category_id = ANY ('{958,968,936}'::integer[])))
Heap Fetches: 2
Total runtime: 2879.854 ms
(13 rows)
Note - I'm specifying the average rather than relying on you looking at the query time to avoid the claim that caching may be misleading me. While the tables are quite large (15 mil businesses, 20 mil categorizations, 1 mil postal codes), I wouldn't expect a drastic change in performance if I'm changing postal_code conditions. In fact, I would expect it to be faster since it could join on less. I'm hesitant to play around with the tuning options given it's such a basic query.
Below are the indices on the postal_codes table. Note - I know they are not all necessary. I'm playing around right now so I'll delete the unnecessary ones when the query starts performing properly.
\d postal_codes;
Table "public.postal_codes"
Column | Type | Modifiers
----------------+------------------------+-----------------------------------------------------------
id | integer | not null default nextval('postal_codes_id_seq'::regclass)
code | character varying(255) |
city | character varying(255) |
region | character varying(255) |
country | character varying(255) |
num_businesses | integer |
region_abbr | text |
Indexes:
"postal_codes_pkey" PRIMARY KEY, btree (id)
"idx_postal_codes_city_region_country" btree (lower(city::text), lower(region::text), country)
"idx_postal_codes_lower_city" btree (lower(city::text))
"idx_postal_codes_lower_region" btree (lower(region::text))
"idx_region_city_postal_codes" btree (lower(region::text), lower(city::text))
"index_postal_codes_on_code" btree (code)
Version and relevant tuning parameters (please let me know if I should consider others):
server_version | 9.3.4
cpu_tuple_cost | 0.01
effective_cache_size | 16GB
maintenance_work_mem | 1GB
random_page_cost | 1.1
seq_page_cost | 1
shared_buffers | 8GB
work_mem | 1GB
I also have AUTOVACCUUM turned on and have re-analyzed businesses, categorizations, and postal_codes (even though I don't think that matters)