I'm having a hard time to figure out, so let me ask you. Given the following query:
select name from users where company_id = ? and creation_date > ?
Let's say that we have only 2 companies and we have millions of users of each company created in different moments. So the cardinality of creation_date
is much higher. Which of the following indexes are faster, and why?
- index_a(company_id, creation_date)
- index_b(creation_date, company_id)
- index_c(creation_date)
- index_d(company_id)
Which index is faster (or theoratically equal)? Ignore disk space usage, unless that somehow impacts read performance. What I think:
(index_b ~= index_c) > index_a > index_d
Because in the Btree the "timestamp" will be grouped in a single region, so the fetching would stop earlier. The company_id
doesn't actually matter because the DB it would need to use the ROWID from the index to touch the table row to fetch the name
for the SELECT
. Almost no diference. In second place comes index_a
which "groups" a low cardinality value together in the BTREE, so it takes some time to the "b-search" show its value by limiting the scope of search with the creation_date
(which is in the "tail" of the index). And finally index_d
is the worse by obvious reasons (cardinality of 2 in a million rows example).
Bônus Question: What if we had 10kk rows, 5kk for company A and Company B and 7kk timestamps distributed evenly for both companies and other 3kk totally different timestamps. Would searches in that 7kk range be much worse than the 3kk range?
Is that right? What am I missing?
(Great place to visualize algorithms: https://www.cs.usfca.edu/~galles/visualization/BTree.html)
P.S: There are two conflicting answer here in StackOverflow:
performant ordering of keys in a MySQL compound index (WRT Rails Polymorphic associations and STI)
For a composite index of columns of different cardinality, does order matter?