0

I'm experiencing a 5x-10x slow down when adding one condition to a WHERE clause in a join. I have verified indexes are being used and this is a pretty simple query with 2 joins:

This query takes an .5 seconds:

EXPLAIN ANALYZE SELECT COUNT(*) FROM "businesses" 
INNER JOIN categorizations ON categorizations.business_id = businesses.id
INNER JOIN postal_codes ON businesses.postal_code_id = postal_codes.id
WHERE categorizations.category_id IN (958,968,936)
AND lower(city) IN ('new york');


                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=60600.79..60600.80 rows=1 width=0) (actual time=741.224..741.225 rows=1 loops=1)
   ->  Hash Join  (cost=321.63..60600.78 rows=2 width=0) (actual time=23.360..740.475 rows=795 loops=1)
         Hash Cond: (businesses.postal_code_id = postal_codes.id)
         ->  Nested Loop  (cost=184.63..60400.82 rows=16784 width=4) (actual time=19.200..690.901 rows=58076 loops=1)
               ->  Bitmap Heap Scan on categorizations  (cost=184.20..17662.46 rows=16784 width=4) (actual time=19.164..131.991 rows=58076 loops=1)
                     Recheck Cond: (category_id = ANY ('{958,968,936}'::integer[]))
                     ->  Bitmap Index Scan on categorizations_category_id  (cost=0.00..180.00 rows=16784 width=0) (actual time=9.994..9.994 rows=58076 loops=1)
                           Index Cond: (category_id = ANY ('{958,968,936}'::integer[]))
               ->  Index Scan using businesses_pkey on businesses  (cost=0.43..2.54 rows=1 width=8) (actual time=0.005..0.006 rows=1 loops=58076)
                     Index Cond: (id = categorizations.business_id)
         ->  Hash  (cost=135.49..135.49 rows=121 width=4) (actual time=0.449..0.449 rows=150 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 6kB
               ->  Index Scan using idx_postal_codes_lower_city on postal_codes  (cost=0.43..135.49 rows=121 width=4) (actual time=0.037..0.312 rows=150 loops=1)
                     Index Cond: (lower((city)::text) = 'new york'::text)
 Total runtime: 741.321 ms
(15 rows)

But adding just one condition (the region) pushes the average to 4 seconds:

EXPLAIN ANALYZE SELECT COUNT(*) FROM "businesses" 
INNER JOIN categorizations ON categorizations.business_id = businesses.id
INNER JOIN postal_codes ON businesses.postal_code_id = postal_codes.id
WHERE categorizations.category_id IN (958,968,936)
AND lower(city) IN ('new york') AND lower(region) = 'new york';

                                                                                              QUERY PLAN                                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1312.76..1312.77 rows=1 width=0) (actual time=2879.764..2879.765 rows=1 loops=1)
   ->  Nested Loop  (cost=16.77..1312.76 rows=1 width=0) (actual time=4.740..2878.936 rows=795 loops=1)
         ->  Nested Loop  (cost=16.21..1281.22 rows=18 width=4) (actual time=2.259..780.067 rows=206972 loops=1)
               ->  Index Scan using idx_postal_codes_city_region_country on postal_codes  (cost=0.43..2.65 rows=1 width=4) (actual time=0.052..0.432 rows=150 loops=1)
                     Index Cond: ((lower((city)::text) = 'new york'::text) AND (lower((region)::text) = 'new york'::text))
               ->  Bitmap Heap Scan on businesses  (cost=15.78..1267.29 rows=1128 width=8) (actual time=0.377..3.179 rows=1380 loops=150)
                     Recheck Cond: (postal_code_id = postal_codes.id)
                     ->  Bitmap Index Scan on index_businesses_on_postal_code_id  (cost=0.00..15.49 rows=1128 width=0) (actual time=0.219..0.219 rows=1380 loops=150)
                           Index Cond: (postal_code_id = postal_codes.id)
         ->  Index Only Scan using index_categorizations_on_business_id_and_category_id_and_source on categorizations  (cost=0.56..1.74 rows=1 width=4) (actual time=0.008..0.008 rows=0 loops=206972)
               Index Cond: ((business_id = businesses.id) AND (category_id = ANY ('{958,968,936}'::integer[])))
               Heap Fetches: 2
 Total runtime: 2879.854 ms
(13 rows)

Note - I'm specifying the average rather than relying on you looking at the query time to avoid the claim that caching may be misleading me. While the tables are quite large (15 mil businesses, 20 mil categorizations, 1 mil postal codes), I wouldn't expect a drastic change in performance if I'm changing postal_code conditions. In fact, I would expect it to be faster since it could join on less. I'm hesitant to play around with the tuning options given it's such a basic query.

Below are the indices on the postal_codes table. Note - I know they are not all necessary. I'm playing around right now so I'll delete the unnecessary ones when the query starts performing properly.

\d postal_codes;
                                     Table "public.postal_codes"
     Column     |          Type          |                         Modifiers                         
----------------+------------------------+-----------------------------------------------------------
 id             | integer                | not null default nextval('postal_codes_id_seq'::regclass)
 code           | character varying(255) | 
 city           | character varying(255) | 
 region         | character varying(255) | 
 country        | character varying(255) | 
 num_businesses | integer                | 
 region_abbr    | text                   | 
Indexes:
    "postal_codes_pkey" PRIMARY KEY, btree (id)
    "idx_postal_codes_city_region_country" btree (lower(city::text), lower(region::text), country)
    "idx_postal_codes_lower_city" btree (lower(city::text))
    "idx_postal_codes_lower_region" btree (lower(region::text))
    "idx_region_city_postal_codes" btree (lower(region::text), lower(city::text))
    "index_postal_codes_on_code" btree (code)

Version and relevant tuning parameters (please let me know if I should consider others):

server_version                      | 9.3.4
cpu_tuple_cost                      | 0.01
effective_cache_size                | 16GB
maintenance_work_mem                | 1GB
random_page_cost                    | 1.1
seq_page_cost                       | 1
shared_buffers                      | 8GB
work_mem                            | 1GB

I also have AUTOVACCUUM turned on and have re-analyzed businesses, categorizations, and postal_codes (even though I don't think that matters)

wildplasser
  • 43,142
  • 8
  • 66
  • 109
Tony
  • 18,776
  • 31
  • 129
  • 193
  • 1
    Looking at the ratio expected/observed, I'd say that your statistics are a bit off (in both cases, but in the second case it shows up more prominently) Plus: please add Your tuning parameters + version to the question. – wildplasser May 08 '14 at 21:15
  • 2
    [This related questions (and answers) might be of help.](http://stackoverflow.com/questions/8228326/how-can-i-avoid-postgresql-sometimes-choosing-a-bad-query-plan-for-one-of-two-ne/8229000) – Erwin Brandstetter May 08 '14 at 21:19
  • @wildplasser Are the ones I added the most relevant? Can't think of any others – Tony May 08 '14 at 21:26
  • Please edit the question to show the table and index specifications. Maybe you're missing an index, or you specified an index incorrectly. – Andy Lester May 08 '14 at 21:26
  • @AndyLester Wouldn't that show up in the query plan as a sequential scan? I can add that stuff but was trying to simplify the report – Tony May 08 '14 at 21:28
  • @ErwinBrandstetter Thank you I did see that in my search although I'm not sure where exactly the statistics need to be improved (postal_codes or businesses) but it seems worth playing around with. – Tony May 08 '14 at 21:31
  • Parameters don't look too bad. Is your table updated frequently? Is there a possobility that dead rows exist (or in the other case: statistics are are outdated) – wildplasser May 08 '14 at 21:36
  • 1
    BTW: `AND lower(city) IN ('new york') AND lower(region) = 'new york';` The lower function will (IMHO) cause the optimiser to get lost (no statistics for functions on fields, even if statistics are available on the raw columns) Can you rewrite the query without the case-folding ? – wildplasser May 08 '14 at 21:40
  • @wildplasser Which table? I'd say businesses is updated quite frequently but I don't have an intuition into a problematic frequency given autovaccuum is on. Good to know about no statistics on functions. I thought I'd be golden with the right index but I'll reconfigure to handle the case folding outside of the database and see if it makes a difference. However, since this a bit surprising since indexes on functions seems like a very tough to use feature if it screws up the planner – Tony May 08 '14 at 21:59
  • in the second query: `-> Nested Loop (cost=16.21..1281.22 rows=18 width=4) (actual time=2.259..780.067 rows=206972 loops=1)` looks extremely painful. – wildplasser May 08 '14 at 22:47
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/52339/discussion-between-tony-and-wildplasser) – Tony May 08 '14 at 22:48
  • @wildplasser No luck with removing the case folding. Same query time – Tony May 08 '14 at 22:54
  • Should also look at categorizations table, too. – Andy Lester May 08 '14 at 23:33
  • BTW: how large is your `work_mem` ? (your maintenance_work_mem looks rather high, IMHO) – wildplasser May 09 '14 at 06:54
  • work_mem = 1GB may be a bit too high (how many concurrent queries do you intend to serve?) BTW I added a few normalisation-tags, since this question is surely caused by (lack of) data normalisation. – wildplasser May 11 '14 at 20:52
  • @wildplasser I played around with that to no avail but I did notice that there is a massive difference in query performance when you do ```EXPLAIN ANALYZE``` I had no idea! Now the 4s query is down to 1.5s which isn't ideal but way more acceptable – Tony May 14 '14 at 17:45
  • @wildplasser I just found that the queries run in the same time if I do not look at it with EXPLAIN ANALYZE. This was a total surprise to me but I think your comments about normalization are a good idea regardless. Is it a bad idea to benchmark with EXPLAIN ANALYZE? Do you think I should answer my own question or do you think it's better to remove this since my question is claiming a bad measurement (although I do explicitly show that I'm using EXPLAIN ANALYZE with the benchmark)? – Tony May 14 '14 at 17:57
  • Blaming it on `EXPLAIN ANALYZE` seems a bit far-fetched to me. (did you repeat the same query twice? then the second run will always be faster, because the buffer cache has been warmed) If you notice a real difference with/without EA, maybe you should ask another question? – wildplasser May 14 '14 at 18:10
  • @wildplasser The cache is definitely warm but if I run the same query 10 times with and without ```EXPLAIN ANALYZE``` I consistently get 3x-4x better performance without ```EXPLAIN ANALYZE```. It sounds far fetched but I think that experiment certainly points to ```EXPLAIN ANALYZE``` being problematic to benchmark – Tony May 14 '14 at 18:33

3 Answers3

1

The real answer: normalise. In the original postal code table, too much of the {country,region,city} information is duplicated. Squeeze this "domain" out into a separate "cities" table. Example:

        -- temp schema for testing purposes.
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

        -- A copy of the original table
-- CREATE TABLE tmp.postal_codes_orig AS (SELECT * FROM public.postal_codes);
        -- .. Which I dont have ...
CREATE TABLE tmp.postal_codes_orig
        ( id SERIAL NOT NULL PRIMARY KEY
        , code character varying(255) UNIQUE
        , city character varying(255)
        , region character varying(255)
        , country character varying(255)
        , num_businesses integer
        , region_abbr text
        );

        -- some data to test it ...
INSERT INTO tmp.postal_codes_orig ( code , city , region , country , num_businesses , region_abbr ) VALUES
 ( '3500' , 'Utrecht' , 'Utrecht', 'Nederland', 1000, 'Ut' )
,( '3501' , 'Utrecht' , 'Utrecht', 'Nederland', 1001, 'UT' )
,( '3502' , 'Utrecht' , 'Utrecht', 'Nederland', 1002, 'Utr.' )
,( '3503' , 'Utrecht' , 'Utrecht', 'Nederland', 1003, 'Utr' )
,( '3504' , 'Utrecht' , 'Utrecht', 'Nederland', 1004, 'Ut.' )
        ;

        -- normalisation: squeeze out "city" domain
CREATE TABLE tmp.cities
        ( id SERIAL NOT NULL PRIMARY KEY
        , city character varying(255)
        , region character varying(255)  -- could be normalised out ...
        , country character varying(255) -- could be normalised out ...
        , region_abbr character varying(255)
        , UNIQUE (city,region,country)
        );

    -- table with all original postal codes, but referring to the cities table instead of duplicating it
CREATE TABLE tmp.postal_codes_cities
        ( id SERIAL NOT NULL PRIMARY KEY
        , code character varying(255) UNIQUE
        , city_id INTEGER NOT NULL REFERENCES tmp.cities(id)
        , num_businesses integer NOT NULL DEFAULT 0 -- this still depends on postal code, not on city
        );
        -- extract the unique cities domain
INSERT INTO tmp.cities(city,region,country,region_abbr)
SELECT DISTINCT city,region,country
        , MIN(region_abbr)
FROM tmp.postal_codes_orig
GROUP BY city,region,country
        ;

CREATE INDEX ON tmp.cities (lower(city::text), lower(region::text), country);
CREATE INDEX ON tmp.cities (lower(city::text));
CREATE INDEX ON tmp.cities (lower(region::text));
CREATE INDEX ON tmp.cities (lower(region::text), lower(city::text));

        -- populate the new postal codes table, retaining the 'stable' ids
INSERT INTO tmp.postal_codes_cities(id, code, city_id, num_businesses)
SELECT pc.id,pc.code, ci.id, pc.num_businesses
FROM tmp.postal_codes_orig pc
JOIN tmp.cities ci ON ci.city = pc.city
        AND pc.region = ci.region
        AND pc.country = ci.country
        ;

        -- and dont forget to set the sequence
SELECT setval('postal_codes_cities_id_seq', MAX(ci.id)) FROM cities ci ;


        -- convenience view mimicking the original table
CREATE VIEW tmp.postal_codes AS
SELECT pc.id AS id
        , ci.city AS city
        , ci.region AS region
        , ci.country AS country
        , pc.num_businesses AS num_businesses
        , ci.region_abbr AS region_abbr
FROM tmp.postal_codes_cities pc
JOIN tmp.cities ci ON ci.id = pc.city_id
        ;

SELECT * FROM tmp.postal_codes;

The foreign keys from the other tables will of course have to be adjusted, pointing to the new postal_codes_cities.id. (or "code", which appears as the natural key to me)

BTW: with a normalised schema, you would not even need the silly indexes on lower(region) and lower(city), because each name is only stored once, so you can force it into canonical form.

wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • code isn't necessarily unique because there can be the same one in multiple countries. What you're suggesting is certainly possible though and it makes sense that referencing the city would increase performance because you take postal_codes out of the consideration. However I'm still a little uneasy about why the Postgres query planner has much worse performance in the second query – Tony May 10 '14 at 21:52
  • 1
    `code isn't necessarily unique because there can be the same one in multiple countries` Then: what is it? Me (as a foreigner) I cannot see the semantics of code, it could be implicit. Please make it explicit. – wildplasser May 11 '14 at 00:00
  • Sorry about the confusion. I can understand why that would be confusing. That is the actual postal_code. So the table is called postal_codes and therefore I thought it would be weird to call a column "postal_code" but now that I think about it, the table is pretty poorly named and probably worth a refactor/renaming. I hope it's clear now – Tony May 11 '14 at 19:24
  • 1
    The reason for the underestimation of the NY/NY rows is that the domains are highly correlated (almost dependent) , and PG uses one-dimensional statistics for both the state and the city fields. WRT the postal_code field: either make it non-UNIQUE, or add a country code to the table and force {country,postal_code} to be unique. – wildplasser May 11 '14 at 19:45
  • I've heard there can be the same postal codes in different areas of a country. I guess I figured it's best to not assume much. code actually is not unique so maybe you just assumed that? I don't think my schema implies that anywhere – Tony May 11 '14 at 20:27
  • I dont know your requirements. If you really have to generalise (to the all the postal_codes existing on this globe) you'll probably have a data modelling problem anyway. Adding a "codetype" field to the key will be the most probable shortcut solution. (where codetype := {US, Canada,France,Netherlands, ...} but preferably an int field pointing to a reference table ("domain") ) – wildplasser May 11 '14 at 20:39
  • After normalizing, the queries appear to perform slower but not so much that the normalization effort is problematic. It's good design – Tony May 15 '14 at 03:48
0

The planner does not know that all of New York City is in the New York region, it assumes those selectivities are independent and can be multiplied. This leads it to make a mistake.

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • It's right. There could be a New York in some other region, theoretically. That being said, querying for all of those postal codes in a separate query runs in 10ms. – Tony May 09 '14 at 03:10
0

The easiest way to avoid reshuffling the NY/NY legs of the query is to put them into a CTE (CTEs are not broken up by the optimiser):

-- EXPLAIN ANALYZE
WITH ny AS (
        SELECT pc.id 
        FROM postal_codes pc 
        WHERE lower(pc.city) = ('new york') AND lower(pc.region) = 'new york';
        )
SELECT COUNT(*) FROM businesses bu
JOIN categorizations ca ON ca.business_id = bu.businesses.id
JOIN ny ON bu.postal_code_id = ny.id
WHERE ca.category_id IN (958,968,936)
        ;
joop
  • 4,330
  • 1
  • 15
  • 26