What indexes do I need to speed up AND/OR SQL queries

Question

Let's assume I have a table named customer like this:

+----+------+----------+-----+
| id | name | lastname | age |
+----+------+----------+-----+
| .. | ...  |   ....   | ... |

and I need to perform the following query:

SELECT * FROM customer WHERE ((name = 'john' OR lastname = 'doe') AND age = 21)

I'm aware of how single and multi-column indexes work, so I created these ones:

(name, age)
(lastname, age)

Is that all the indexes I need?

The above condition can be rephrased as:

... WHERE ((name = 'john' AND age = 21) OR (lastname = 'doe' AND age = 21)

but I'm not sure how smart RDBMS are, and if those indexes are the correct ones

Why not perform some diagnostics against the queries? I.e. if you're using SQL Server use the execution plan and see if the indexes save cost. — Darren, Nov 28 '14 at 21:23
Short of putting the whole table into an index I dont think there much optimization that can be done for that query — Mihai, Nov 28 '14 at 21:32
@Mihai my table have much more fields than that... that's just an example. — Oscar Mederos, Nov 28 '14 at 21:35
@DarrenDavies unfortunately I don't have enough data to measure tests... and I'm not aware of any tool that can tell me if that is currently working in my db schema or not . — Oscar Mederos, Nov 28 '14 at 21:36
Questions concerning indexing or performance may depend on your version of Postgres. *Always* provide it. — Erwin Brandstetter, Nov 29 '14 at 02:15

score 1 · Accepted Answer · edited May 23 '17 at 12:25

Your approach is reasonable. Two factors are essential here:

Postgres can combine multiple indexes very efficiently with bitmap index scans.
- PostgreSQL versus MySQL for EAV structures storage
B-tree index usage is by far most effective when only leading columns of the index are involved.
- Is a composite index also good for queries on the first field?
- Working of indexes in PostgreSQL

Test case

If you don't have enough data to measure tests, you can always whip up a quick test case like this:

CREATE TABLE customer (id int, name text, lastname text, age int);

INSERT INTO customer
SELECT g
     , left(md5('foo'::text || g%500) , 3 + ((g%5)^2)::int)
     , left(md5('bar'::text || g%1000), 5 + ((g%5)^2)::int)
     , ((random()^2) * 100)::int
FROM   generate_series(1, 30000) g; -- 30k rows for quick test case

For your query (reformatted):

SELECT *
FROM   customer
WHERE (name = 'john' OR lastname = 'doe')
AND    age = 21;

I would go with

CREATE INDEX customer_age_name_idx ON customer (age, name);
CREATE INDEX customer_age_lastname_idx ON customer (age, lastname);

However, depending on many factors, a single index with all three columns and age as first may be able to deliver similar performance. The rule of thumb is to create as few indexes as possible and as many as necessary.

CREATE INDEX customer_age_lastname_name_idx ON customer (age, lastname, name);

The check on (age, name) is potentially slower in this case, but depending on selectivity of the first column it may not matter much.

Updated SQL Fiddle.

Why `age` first in the index?

This is not very important and needs deeper understanding to explain. But since you ask ...

The order of columns doesn't matter for the 2-column indexes customer_age_name_idx and customer_age_lastname_idx. Details and a test-case:

Multicolumn index and performance

I still put age first to stay consistent with the 3rd index I suggested customer_age_lastname_name_idx, where the order of columns does matter in multiple ways:

Most importantly, both your predicates (age, name) and (age, lastname) share the column age. B-tree indexes are (by far) most effective on leading columns, so putting age first benefits both.

And, less importantly, but still relevant: the size of the index is smaller this way due to data type characteristics, alignment, padding and page layout of index pages.

age is a 4-byte integer and must be aligned at multiples of 4 bytes in the data page. text is of variable length and has no alignment restrictions. Putting the integer first or last is more efficient due to the rules of "column tetris". I added another index on (lastname, age, name) (age in the middle!) to the fiddle just to demonstrate it's ~ 10 % bigger. No space lost to additional padding, which results in a smaller index. And size matters.

For the same reasons it would be better to reorder columns in the demo table like this: (id, age, name, lastname). If you want to learn why, start here:

Everything I wrote is for the case at hand. If you have other queries / other requirements, the resulting strategy may change.

`UNION` query equivalent?

Note that a UNION query may or may not return the same result. It folds duplicate rows, which your original does not. Even if you don't have complete duplicates in your table, you may still see this effect with a subset of columns in the SELECT list. Do not blindly substitute with a UNION query. It's not going to be faster anyway.

Thank you very much for your response ;) I have a quick question... why does it matter, in my query, `(age, name)` vs `(name, age)`? — Oscar Mederos, Nov 30 '14 at 00:35
@OscarMederos: Well, it doesn't really matter for the 2-column index. But it matters for the 3-column index - and also for the underlying table. — Erwin Brandstetter, Nov 30 '14 at 04:23

score 0 · Answer 2 · answered Nov 28 '14 at 21:39

0

Turn the OR into two queries UNIONed:

SELECT * FROM Customer WHERE Age = 21 AND Name = 'John'
UNION
SELECT * FROM Customer WHERE Age = 21 AND LastName = 'Doe'

Then create an index over (Age, Name) and another over (Age, LastName).

answered Nov 28 '14 at 21:39

Ricardo Peres

13,724
5
57
74

1

Thanks for your reply. I know I could do that, but that's not exactly my question... – Oscar Mederos Nov 28 '14 at 21:43
Yes, those are the indexes. But it is recommended to use UNION instead of OR. – Ricardo Peres Nov 28 '14 at 21:44
@RicardoPeres: That "recommendation" is news to me. Who's recommending that? – Erwin Brandstetter Nov 29 '14 at 01:59
http://www.databasechannel.com/AccessArticles/Article_ORAND_UseUnionInsteadofOR.html – Ricardo Peres Nov 29 '14 at 08:07
The `UNION` could be optimal for sqlserver, but it definitely is slower for Postgres. (and the `UNION` should at least be an `UNION ALL`) – wildplasser Nov 29 '14 at 13:45
You are right about UNION ALL, my fault. But like I said, and there are lots of links about it, the performance of UNION os gwnerally better, for normal cases. – Ricardo Peres Nov 29 '14 at 13:51
1

Your [1st link](http://www.databasechannel.com/AccessArticles/Article_ORAND_UseUnionInsteadofOR.html) is for Access 2007 (!), your [2nd](http://sqlserverplanet.com/optimization/using-union-instead-of-or) and [3rd](http://sqltouch.blogspot.pt/2013/03/or-vs-union-all-have-you-thought-about.html) are for SQL Server. All of them address implementation details of the respective MS products and are irrelevant for Postgres (or other SQL implementations). – Erwin Brandstetter Nov 30 '14 at 02:56
@wildplasser: `UNION ALL` may or may not be "better". It's even further away from the original as `UNION`. The former potentially adds additional duplicate rows in the result, the latter may remove duplicates. Neither *generally* guarantees an identical result to the original query. If the original query never returns duplicate rows, `UNION` can emulate it, but ***not*** `UNION ALL`. – Erwin Brandstetter Nov 30 '14 at 03:00
Oops, my mistake. Except for the performance part: `UNION` is terrible. – wildplasser Nov 30 '14 at 12:20

What indexes do I need to speed up AND/OR SQL queries

2 Answers2

Test case

Why age first in the index?

UNION query equivalent?

Why `age` first in the index?

`UNION` query equivalent?