1

Let's assume I have a table named customer like this:

+----+------+----------+-----+
| id | name | lastname | age |
+----+------+----------+-----+
| .. | ...  |   ....   | ... |

and I need to perform the following query:

SELECT * FROM customer WHERE ((name = 'john' OR lastname = 'doe') AND age = 21)

I'm aware of how single and multi-column indexes work, so I created these ones:

(name, age)
(lastname, age)

Is that all the indexes I need?

The above condition can be rephrased as:

... WHERE ((name = 'john' AND age = 21) OR (lastname = 'doe' AND age = 21)

but I'm not sure how smart RDBMS are, and if those indexes are the correct ones

Oscar Mederos
  • 29,016
  • 22
  • 84
  • 124
  • 2
    Why not perform some diagnostics against the queries? I.e. if you're using SQL Server use the execution plan and see if the indexes save cost. – Darren Nov 28 '14 at 21:23
  • Short of putting the whole table into an index I dont think there much optimization that can be done for that query – Mihai Nov 28 '14 at 21:32
  • @Mihai my table have much more fields than that... that's just an example. – Oscar Mederos Nov 28 '14 at 21:35
  • @DarrenDavies unfortunately I don't have enough data to measure tests... and I'm not aware of any tool that can tell me if that is currently working in my db schema or not . – Oscar Mederos Nov 28 '14 at 21:36
  • Questions concerning indexing or performance may depend on your version of Postgres. *Always* provide it. – Erwin Brandstetter Nov 29 '14 at 02:15

2 Answers2

1

Your approach is reasonable. Two factors are essential here:

  1. Postgres can combine multiple indexes very efficiently with bitmap index scans.

  2. B-tree index usage is by far most effective when only leading columns of the index are involved.

Test case

If you don't have enough data to measure tests, you can always whip up a quick test case like this:

CREATE TABLE customer (id int, name text, lastname text, age int);

INSERT INTO customer
SELECT g
     , left(md5('foo'::text || g%500) , 3 + ((g%5)^2)::int)
     , left(md5('bar'::text || g%1000), 5 + ((g%5)^2)::int)
     , ((random()^2) * 100)::int
FROM   generate_series(1, 30000) g; -- 30k rows for quick test case

For your query (reformatted):

SELECT *
FROM   customer
WHERE (name = 'john' OR lastname = 'doe')
AND    age = 21;

I would go with

CREATE INDEX customer_age_name_idx ON customer (age, name);
CREATE INDEX customer_age_lastname_idx ON customer (age, lastname);

However, depending on many factors, a single index with all three columns and age as first may be able to deliver similar performance. The rule of thumb is to create as few indexes as possible and as many as necessary.

CREATE INDEX customer_age_lastname_name_idx ON customer (age, lastname, name);

The check on (age, name) is potentially slower in this case, but depending on selectivity of the first column it may not matter much.

Updated SQL Fiddle.

Why age first in the index?

This is not very important and needs deeper understanding to explain. But since you ask ...

The order of columns doesn't matter for the 2-column indexes customer_age_name_idx and customer_age_lastname_idx. Details and a test-case:

I still put age first to stay consistent with the 3rd index I suggested customer_age_lastname_name_idx, where the order of columns does matter in multiple ways:

Most importantly, both your predicates (age, name) and (age, lastname) share the column age. B-tree indexes are (by far) most effective on leading columns, so putting age first benefits both.

And, less importantly, but still relevant: the size of the index is smaller this way due to data type characteristics, alignment, padding and page layout of index pages.

age is a 4-byte integer and must be aligned at multiples of 4 bytes in the data page. text is of variable length and has no alignment restrictions. Putting the integer first or last is more efficient due to the rules of "column tetris". I added another index on (lastname, age, name) (age in the middle!) to the fiddle just to demonstrate it's ~ 10 % bigger. No space lost to additional padding, which results in a smaller index. And size matters.

For the same reasons it would be better to reorder columns in the demo table like this: (id, age, name, lastname). If you want to learn why, start here:

Everything I wrote is for the case at hand. If you have other queries / other requirements, the resulting strategy may change.

UNION query equivalent?

Note that a UNION query may or may not return the same result. It folds duplicate rows, which your original does not. Even if you don't have complete duplicates in your table, you may still see this effect with a subset of columns in the SELECT list. Do not blindly substitute with a UNION query. It's not going to be faster anyway.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you very much for your response ;) I have a quick question... why does it matter, in my query, `(age, name)` vs `(name, age)`? – Oscar Mederos Nov 30 '14 at 00:35
  • @OscarMederos: Well, it doesn't really matter for the 2-column index. But it matters for the 3-column index - and also for the underlying table. – Erwin Brandstetter Nov 30 '14 at 04:23
0

Turn the OR into two queries UNIONed:

SELECT * FROM Customer WHERE Age = 21 AND Name = 'John'
UNION
SELECT * FROM Customer WHERE Age = 21 AND LastName = 'Doe'

Then create an index over (Age, Name) and another over (Age, LastName).

Ricardo Peres
  • 13,724
  • 5
  • 57
  • 74
  • 1
    Thanks for your reply. I know I could do that, but that's not exactly my question... – Oscar Mederos Nov 28 '14 at 21:43
  • Yes, those are the indexes. But it is recommended to use UNION instead of OR. – Ricardo Peres Nov 28 '14 at 21:44
  • @RicardoPeres: That "recommendation" is news to me. Who's recommending that? – Erwin Brandstetter Nov 29 '14 at 01:59
  • http://www.databasechannel.com/AccessArticles/Article_ORAND_UseUnionInsteadofOR.html – Ricardo Peres Nov 29 '14 at 08:07
  • The `UNION` could be optimal for sqlserver, but it definitely is slower for Postgres. (and the `UNION` should at least be an `UNION ALL`) – wildplasser Nov 29 '14 at 13:45
  • You are right about UNION ALL, my fault. But like I said, and there are lots of links about it, the performance of UNION os gwnerally better, for normal cases. – Ricardo Peres Nov 29 '14 at 13:51
  • 1
    Your [1st link](http://www.databasechannel.com/AccessArticles/Article_ORAND_UseUnionInsteadofOR.html) is for Access 2007 (!), your [2nd](http://sqlserverplanet.com/optimization/using-union-instead-of-or) and [3rd](http://sqltouch.blogspot.pt/2013/03/or-vs-union-all-have-you-thought-about.html) are for SQL Server. All of them address implementation details of the respective MS products and are irrelevant for Postgres (or other SQL implementations). – Erwin Brandstetter Nov 30 '14 at 02:56
  • @wildplasser: `UNION ALL` may or may not be "better". It's even further away from the original as `UNION`. The former potentially adds additional duplicate rows in the result, the latter may remove duplicates. Neither *generally* guarantees an identical result to the original query. If the original query never returns duplicate rows, `UNION` can emulate it, but ***not*** `UNION ALL`. – Erwin Brandstetter Nov 30 '14 at 03:00
  • Oops, my mistake. Except for the performance part: `UNION` is terrible. – wildplasser Nov 30 '14 at 12:20