0

I have three tables in my app, call them tableA, tableB, and tableC. tableA has fields for tableB_id and tableC_id, with indexes on both. tableB has a field foo with an index, and tableC has a field bar with an index.

When I do the following query:

select * 
from tableA 
  left outer join tableB on tableB.id = tableA.tableB_id 
where lower(tableB.foo) = lower(my_input)

it is really slow (~1 second).

When I do the following query:

select * 
from tableA 
   left outer join tableC on tableC.id = tabelA.tableC_id 
where lower(tableC.bar) = lower(my_input)

it is really fast (~20 ms).

From what I can tell, the tables are about the same size.

Any ideas as to the huge performance difference between the two queries?


UPDATES

Table sizes:

  • tableA: 2061392 rows

  • tableB: 175339 rows

  • tableC: 1888912 rows


tag info

Postgres version - 9.3.5

Full text of the queries are above.

Explain plans - tableB tableC

Relevant info from tables:

  • tableA
    • tableB_id, integer, no modifiers, storage plain
      • "index_tableA_on_tableB_id" btree (tableB_id)
    • tableC_id, integer, no modifiers, storage plain,
      • "index_tableA_on_tableB_id" btree (tableC_id)
  • tableB
    • id, integer, not null default nextval('tableB_id_seq'::regclass), storage plain
      • "tableB_pkey" PRIMARY_KEY, btree (id)
    • foo, character varying(255), no modifiers, storage extended
      • "index_tableB_on_lower_foo_tableD" UNIQUE, btree (lower(foo::text), tableD_id)
        • tableD is a separate table that is otherwise irrelevant
  • tableC
    • id, integer, not null default nextval('tableC_id_seq'::regclass), storage plain
      • "tableC_pkey" PRIMARY_KEY, btree (id)
    • bar, character varying(255), no modifiers, storage extended
      • "index_tableC_on_tableB_id_and_bar" UNIQUE, btree (tableB_id, bar)
      • "index_tableC_on_lower_bar" btree (lower(bar::text))

Hardware:

  • OS X 10.10.2

  • CPU: 1.4 GHz Intel Core i5

  • Memory: 8 GB 1600 MHz DDR3

  • Graphics: Intel HD Graphics 5000 1536 MB


Solution

Looks like running vacuum and then analyze on all three tables fixed the issue. After running the commands, the slow query started using "index_patients_on_foo_tableD".

Community
  • 1
  • 1
Dillon Welch
  • 481
  • 4
  • 15
  • 1
    Could you verify the DDL of the three tables and the actual data to make sure TableB and TableC are equivalent. Also, did you repeat the test good number times (maybe 10) and got the same high difference in the performance? Also, is my_input also the same? Then, explain plan for both queries. – Edmon Feb 17 '15 at 00:45
  • Unless you did something you haven't told us about, expressions like `lower(tableB.foo) = lower(my_input)` won't use an index. Paste CREATE TABLE and some INSERT statements into your question. – Mike Sherrill 'Cat Recall' Feb 17 '15 at 00:58
  • 1
    The question needs proper information to work with. Consider: http://stackoverflow.com/tags/postgresql-performance/info – Erwin Brandstetter Feb 17 '15 at 05:44
  • @Edmon my mistake on the two tables, they are not actually equivalent. tableC (the fast one) has about 10x more rows than tableB. Tests were repeated many times on multiple dev machines and in production. my_input is the same for both queries (in the explain plans it is '123456'). Explain plans have been provided. – Dillon Welch Feb 17 '15 at 18:45
  • @MikeSherrill'CatRecall' There is an index on `lower(tableB.foo)` that is not being used, but there is an index on `lower(tableC.bar)` that IS being used. I've provided table info in my edits. – Dillon Welch Feb 17 '15 at 18:47
  • Much better. But what you get with **`\d+ tbl` in [psql](http://www.postgresql.org/docs/current/interactive/app-psql.html)** or complete `CREATE TABLE` scripts are *much* more useful for a table description than a manual description. Nothing lost in translation, much easier to read for regulars. Hardware is less important to this question, graphics not relevant. **Actual times** are missing from your `EXPLAIN` output. Use `EXPLAIN (BUFFERS, ANALYZE)`, not just `EXPLAIN`. All of this is [in the cited instructions](http://stackoverflow.com/tags/postgresql-performance/info). – Erwin Brandstetter Feb 17 '15 at 20:37
  • @Erwin, the text I provided was straight from `\d+ tbl` (except for changing the table/field names for privacy). Sorry about the explain output, I am using pgAdmin3 and used the explain shortcut with buffers selected instead of manually typing `EXPLAIN (BUFFERS, ANALYZE). Looks like it doesn't matter anyway since vacuuming the table solved the issue. – Dillon Welch Feb 17 '15 at 21:11
  • 1
    If a manual `VACUUM` solved the issue, you should inspect your [autovacuum settings](http://www.postgresql.org/docs/current/interactive/routine-vacuuming.html) Normally autovacuum should kick in and clean up automatically - unless you constantly have locks on the table. – Erwin Brandstetter Feb 17 '15 at 21:34

2 Answers2

1

For starters, your LEFT JOIN is counteracted by the predicate on the left table and is forced to act like an [INNER] JOIN. Replace with:

SELECT *
FROM   tableA a
JOIN   tableB b ON b.id = a.tableB_id
WHERE  lower(b.foo) = lower(my_input);

Or, if you actually want the LEFT JOIN to include all rows from tableA:

SELECT *
FROM   tableA a
LEFT   JOIN tableB b ON b.id = a.tableB_id
                    AND lower(b.foo) = lower(my_input);

I think you want the first one.

An index on (lower(foo::text)) like you posted is syntactically invalid. You better post the verbatim output from \d tbl in psql like I commented repeatedly. A shorthand syntax for a cast (foo::text) in an index definition needs more parentheses, or use the standard syntax: cast(foo AS text):

But that's also unnecessary. You can just use the data type (character varying(255)) of foo. Of course, the data type character varying(255) rarely makes sense in Postgres to begin with. The odd limitation to 255 characters is derived from limitations in other RDBMS which do not apply in Postgres. Details:

Be that as it may. The perfect index for this kind of query would be a multicolumn index on B - if (and only if) you get index-only scans out of this:

CREATE INDEX "tableB_lower_foo_id" ON tableB (lower(foo), id);

You can then drop the mostly superseded index "index_tableB_on_lower_foo". Same for tableC.
The rest is covered by the (more important!) indices in table A on tableB_id and tableC_id.

If there are multiple rows in tableA per tableB_id / tableC_id, then either one of these competing commands can swing the performance to favor the respective query by physically clustering related rows together:

CLUSTER tableA USING "index_tableA_on_tableB_id";
CLUSTER tableA USING "index_tableA_on_tableC_id";

You can't have both. It's either B or C. CLUSTER also does everything a VACUUM FULL would do. But be sure to read the details first:

And don't use mixed case identifiers, sometimes quoted, sometimes not. This is very confusing and is bound to lead to errors. Use legal, lower-case identifiers exclusively - then it doesn't matter if you double-quote them or not.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I believe I've provided all the info you requested. Let me know if I missed something. As for your query suggestions, the first one I stopped after it ran for 20 s without finishing. The second one took the same amount of time as my current query. – Dillon Welch Feb 17 '15 at 18:50
  • Correctness comes before performance. My two queries do different things. Not sure, which is right for you. The `LEFT JOIN` in your query is just wrong. I suspect you want my first query - that's what your query ends up doing now. Also added a couple more hints and pointers. – Erwin Brandstetter Feb 17 '15 at 22:27
  • Yes, I did want your first query. Running it again after the vacuum shows that it runs just as fast and returns the same results. Thank you! – Dillon Welch Feb 17 '15 at 22:35
  • I was able to remove the index that you suggested. Thanks again! The `character varying(255)` is coming from Ruby on Rails. I have to anonymize any schema info before posting it here, hence the `tableA` and `tableB` examples. I've given, what I can tell, is the exact output from `\d tbl` with the actual names replaced with `tableA`, `tableB`, etc. There was one index that I was missing a parenthesis on; I've fixed that now. Let me know if there's anything still weird. The indexes are also being created through Ruby on Rails. – Dillon Welch Feb 17 '15 at 23:22
1

The other thing is that you have your indexed columns queried as lower() , which can also be creating a partial index when the query is running.

If you will always query the column as lower() then your column should be indexed as lower(column_name) as in:

create index idx_1 on tableb(lower(foo));

Also, have you looked at the execution plan? This will answer all your questions if you can see how it is querying the tables.

Honestly, there are many factors to this. The best solution is to study up on INDEXES, specifically in Postgres so you can see how they work. It is a bit of holistic subject, you can't really answer all your problems with a minimal understanding of how they work.

For instance, Postgres has an initial "lets look at these tables and see how we should query them" before the query runs. It looks over all tables, how big each of the tables are, what indexes exist, etc. and then figures out how the query should run. THEN it executes it. Oftentimes, this is what is wrong. The engine incorrectly determines how to execute it.

A lot of the calculations of this are done off of the summarized table statistics. You can reset the summarized table statistics for any table by doing:

vacuum [table_name];

(this helps to prevent bloating from dead rows)

and then:

analyze [table_name];

I haven't always seen this work, but often times it helps.

ANyway, so best bet is to:

a) Study up on Postgres indexes (a SIMPLE write up, not something ridiculously complex) b) Study up the execution plan of the query c) Using your understanding of Postgres indexes and how the query plan is executing, you cannot help but solve the exact problem.

Walker Farrow
  • 3,579
  • 7
  • 29
  • 51
  • Huh, looks like this was the issue. After I ran vacuum/analyze on each table, the query started using the index I was expecting it to and it was very fast. Thanks! – Dillon Welch Feb 17 '15 at 21:07