3

Simplified table:

CREATE TABLE products (
product_no integer PRIMARY KEY,
sales integer,
status varchar(16),
category varchar(16));

CREATE INDEX index_products_sales ON products (sales);
CREATE INDEX index_products_status ON products (status);
CREATE INDEX index_products_category ON products (category);

PostgreSQL version is 8.4. Columns 'status' and 'category'

There are 20 million products/rows spread across 15 categories.

One of the most used queries is getting the three most sold products, excluding products in categories 'cat3' and 'cat7':

SELECT product_no, sales 
FROM products 
WHERE status = 'something' AND category NOT IN ('cat3', 'cat7') 
ORDER BY sales DESC 
LIMIT 3;

Limit  (cost=0.00..8833.39 rows=3 width=12) (actual time=9235.332..9356.284 rows=3 loops=1)
   ->  Index Scan using index_products_sales on products  (cost=0.00..68935806.85 rows=23412 width=12) (actual time=9235.327..9356.278 rows=3 loops=1)
     Filter: (((category)::text <> ALL ('{cat3,cat7}'::text[])) AND ((status)::text = 'something'::text))

What would be the best index for making this specific query run faster?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    CREATE INDEX index_special ON products (status, sales DESC NULLS LAST) WHERE NOT (category = 'cat3' and category = 'cat7'); – j.p. Jun 21 '12 at 19:03
  • 1
    @jug: The `WHERE` condition of the index is logically wrong, all rows qualify this way because no row can match both categories at the same time. I added an answer. – Erwin Brandstetter Jul 22 '12 at 02:52
  • @ErwinBrandstetter: You're right, of course. – j.p. Jul 23 '12 at 06:27

2 Answers2

13

Create a partial, multicolumn index with this particular sort order:

CREATE INDEX products_status_sales_partial_idx ON products (status, sales DESC)
WHERE  category NOT IN ('cat3','cat7');

Modify your query slightly:

SELECT product_no, sales 
FROM   products 
WHERE  status = 'something'
AND    category NOT IN ('cat3', 'cat7') 
ORDER  BY status, sales DESC 
LIMIT  3;

Adding status as first element of the ORDER BY clause seems redundant and pointless. But give it a try.

Why?

The query planner is not smart enough to understand, that with

WHERE  status = 'something' ...
ORDER  BY sales DESC

the sort order of the index (status, sales DESC) matches as a logical consequence. So it is going to read all qualifying rows, sort and pick the top 3.

By adding status to the ORDER BY you enable the query planner to read the top 3 entries from the index directly. Expect a speed-up by several orders of magnitude.

Tested with PostgreSQL 8.4 and 9.1.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
2

I think a b-tree index is still your best bet. I could be wrong, though. I think I would test two things.

First, a partial index on category that excludes 'cat3' and 'cat7'.

CREATE INDEX index_products_category ON products (category)
  WHERE category NOT IN ('cat3','cat7');

Second, a descending sort on sales.

CREATE INDEX index_products_sales ON products (sales DESC);

Either one of these might slow down other queries, though, so you might need one or both of these in addition to the existing indexes.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • Thanks for the answer, but should it really say "(category = 'cat3' **and** category = 'cat7')" in the WHERE NOT-clause? – Backend Viking Jun 26 '12 at 12:46
  • No, it shouldn't, and I'm not sure why I wrote anything like that WHERE clause in the first place. I should have been awake by that time. `WHERE category NOT IN ('cat3','cat7')` is how I'd normally think. Here, on a table of a million rows, this runs to completion in about .06ms. – Mike Sherrill 'Cat Recall' Jun 26 '12 at 13:57