1

In a Postgres database, I've noticed I can add a compound index key on the same set of columns, but in different orders. E.g:

CREATE INDEX index_1 ON public.A USING btree (x, y);
CREATE INDEX index_2 ON public.A USING btree (y, x);

Note the indexes only differ by the order of the columns indexed.

I would have expected Postgres to treat the column list as an unordered set.

Would this ability to add two similar indexes imply that the order of the conditions in the WHERE and ON clauses matters for retrieval speed?

  • 2
    No, the list of columns is not an unordered set. https://use-the-index-luke.com/sql/anatomy –  Dec 06 '22 at 14:09
  • 1
    For "seek" operators (predicates with "=" sign) the order does not matter between those two indexes (OK, there's a difference but is marginal). For "range scan" operators (predicates with ">" and/or "<" signs) those two indexes offer vastly different solutions. – The Impaler Dec 06 '22 at 14:16
  • 2
    @TheImpaler: `where x = 10` will make use of the first index and `where y = 10` will make use of the second index –  Dec 06 '22 at 14:36
  • @a_horse_with_no_name Yes, that's also true, when using parts of an index for seeking or scanning. – The Impaler Dec 06 '22 at 14:43
  • @a_horse_with_no_name thanks for the link! Regarding your where clauses, I should mention that I also have indexes on the individual respective columns themselves, so I'd assume those two where clauses you bring up would be covered by those actually. – Seán Healy Dec 06 '22 at 14:49
  • If you have an index on `(x,y)` you can drop your index on `(x)` –  Dec 06 '22 at 14:50
  • @TheImpaler do you mean specifically range scans that use the columns together? e.g. WHERE (a, b) < (1,10). Assuming we also have indexes on a and b respectively, why would the order of a, b or b,a matter in an additional compound index while using comparators in any other way than the previous example? I.e. would it make any difference in the case of WHERE a < 1 and b < 10? – Seán Healy Dec 06 '22 at 14:52
  • @SeánHealy Yes, for inequality searches it does make a difference. If you search by `(a, b) < (1, 10)` then a row with `(0, 25)` will be selected. If you search with `a < 1 and b < 10` that row won't be selected. – The Impaler Dec 06 '22 at 14:56
  • @TheImpaler to clarify, I wasn't suggesting that my two examples were programmatically equivalent. I was specifically asking if the speed boost would only apply only in the case of clauses like `WHERE (a, b) < (1, 10)`. Supposing I have indexes on each of the columns respectively, would the addition of two compound indexes (over x,y and y,x) cause any difference in the performance of `WHERE a < 1 and b < 10`? – Seán Healy Dec 06 '22 at 15:08
  • @SeánHealy Different indexes speed up different queries. Each one of the indexex (x), (y), (x, y), (y, x) can be very performant for one query and really slow for another query. It's important to decide which queries you want to optimize and take that as an input to decide which indexes to create. – The Impaler Dec 06 '22 at 15:30
  • I'm doing research with my data, so I basically want to optimise for a tonne of "potential" queries. If I can create an index on a column or group of columns, more often than not, I create the index. I just want to figure out if indexing by x,y rather than by y,x would impact any queries other than `WHERE (x, y) < (1, 10)`-style ones? I rarely make queries like that, so I may be okay with just indexing by y,x. Whereas, if lacking an index on the columns in the opposite order (x,y) has an impact anywhere else, then I would try adding that index too. – Seán Healy Dec 06 '22 at 16:43
  • You might like my answer here: https://stackoverflow.com/questions/24315151/does-order-of-fields-of-multi-column-index-in-mysql-matter/24315527#24315527 – Bill Karwin Dec 06 '22 at 17:28

1 Answers1

1

I would have expected Postgres to treat the column list as an unordered set.

There is a very simple reason why this is not the case: computer memory is not multi-dimensional.

Imagine you have 10 items, each with a unique combination of x and y. On paper, you could plot them on a grid, with x on one axis, and y on the other. To find all values for a particular x, you could look along the appropriate row; for all values for a particular y, look down the appropriate column. If you need three variables, it becomes harder to draw on paper, but you can visualise a cube; mathematicians will visualise higher dimensional "hypercubes".

Computer memory (and storage) doesn't work that way; it's just a linear list of records, one after another. So your two-element index isn't stored as a grid, it's stored as a sorted list. If the index is on (x, y), the basic arrangement looks like this:

  • Sort all the records by x
  • Within each group that has the same x, sort those records by y

(The actual structure is a bit more complex than just a list, so that it's easy to jump to the right group, and you don't have to rewrite the whole index every time a low value of x needs to be added, but I think the general principle is good enough for the current discussion.)

Now, if you want to find all elements with a particular x, you can jump to that group of records, and return them all. But if you want to find all elements with a particular y, you have to look through every group to see if any are there.

Would this ability to add two similar indexes imply that the order of the conditions in the WHERE and ON clauses matters for retrieval speed?

The order they appear in the query generally doesn't matter, because a query defines a logical result, not an order of execution. What matters is the order the DBMS wants to check them to return that data efficiently.

For instance, imagine we know that x and y are both evenly distributed from 0 to 99. Given the clause WHERE x = 0 AND y = 42, it doesn't matter which index we use:

  • If we use an index on (x, y), we jump to the bucket for x=0 and find items in that bucket with y=42
  • If we use an index on (y, x), we jump to the bucket for y=42 and find items in that bucket with x=0

But if want to process the clause WHERE x <> 0 AND y = 42, it makes a huge difference:

  • If we use an index on (x, y), we have to look through the buckets for x=1, x=2, ... x=99; then in each bucket, look for records with y=42
  • If we use an index on (y, x), we can jump straight to the bucket for y=42, skip past the items for x=0, and return the rest of the bucket

This is the kind of decision that the query planner in a DBMS will evaluate: which order of operations is most likely, based on available statistics about the data, to result in the fastest results.

While these query planners are incredibly sophisticated, you have knowledge about your data, and the ways you want to use it, that they don't. Even if you trust the query planner never to pick the wrong one, adding more indexes has a cost, in storage, and in extra processing on every INSERT, UPDATE, and DELETE.

IMSoP
  • 89,526
  • 13
  • 117
  • 169