2

I'm having a hard time to figure out, so let me ask you. Given the following query:

select name from users where company_id = ? and creation_date > ?

Let's say that we have only 2 companies and we have millions of users of each company created in different moments. So the cardinality of creation_date is much higher. Which of the following indexes are faster, and why?

  1. index_a(company_id, creation_date)
  2. index_b(creation_date, company_id)
  3. index_c(creation_date)
  4. index_d(company_id)

Which index is faster (or theoratically equal)? Ignore disk space usage, unless that somehow impacts read performance. What I think:

(index_b ~= index_c) > index_a > index_d Because in the Btree the "timestamp" will be grouped in a single region, so the fetching would stop earlier. The company_id doesn't actually matter because the DB it would need to use the ROWID from the index to touch the table row to fetch the name for the SELECT. Almost no diference. In second place comes index_a which "groups" a low cardinality value together in the BTREE, so it takes some time to the "b-search" show its value by limiting the scope of search with the creation_date (which is in the "tail" of the index). And finally index_d is the worse by obvious reasons (cardinality of 2 in a million rows example).

Bônus Question: What if we had 10kk rows, 5kk for company A and Company B and 7kk timestamps distributed evenly for both companies and other 3kk totally different timestamps. Would searches in that 7kk range be much worse than the 3kk range?

Is that right? What am I missing?

(Great place to visualize algorithms: https://www.cs.usfca.edu/~galles/visualization/BTree.html)

P.S: There are two conflicting answer here in StackOverflow:

performant ordering of keys in a MySQL compound index (WRT Rails Polymorphic associations and STI)

For a composite index of columns of different cardinality, does order matter?

Israel Fonseca
  • 1,082
  • 10
  • 22
  • 1
    This will be helpful https://www.percona.com/blog/2009/06/05/a-rule-of-thumb-for-choosing-column-order-in-indexes/ – Amit Kumar Jul 06 '18 at 06:45
  • 1
    Cardinality of individual columns in a composite index does _not_ matter. See also my [_Index Cookbook_](http://mysql.rjweb.org/doc.php/index_cookbook_mysql) – Rick James Jul 06 '18 at 16:34
  • @RickJames, interesting (I also read your https://stackoverflow.com/questions/50239658/higher-cardinality-column-first-in-an-index-when-involving-a-range). So for simple equality it doesn't matter, but when we have range in the index it should come last? Is that a particularity for MySQL? I'm having a hard time to figure out why the search in the BTree would be affected by that. So for this example, `index_b` and `index_c` would be at the same speed? – Israel Fonseca Jul 09 '18 at 13:10
  • @IsraelFonseca - b and c are virtually identical. Think of `INDEX(lastname, firstname)` and `WHERE lastname LIKE 'F%' AND firstname = 'Israel'`. Better yet, think of a paper list of a million names sorted by lastname, then firstname and you are looking for "Israel F." You would end up looking through all the Fs. – Rick James Jul 09 '18 at 15:56
  • @IsraelFonseca - Not particular to MySQL, particular to BTrees. And there are few, if any, other index structures that would come close to the efficiency of a BTree for _this_ example (a mixture of `=` and 'range'). For all `=` (no range), "Hash" is about as good as BTree. But Hash is useless for ranges. I think the MySQL inventors said: "why bother with Hash; BTree is just as good, plus it does more". – Rick James Jul 09 '18 at 16:02

1 Answers1

2

For that particular query, index_a should be fastest, because the results correspond exactly to a range in the index.

Using index_b or index_c is slower. You have to get the range of valid dates and then filter out rows with the wrong company ID. Of the two, index_c is slower because you have to touch the rows you filter out.

Using index_d is slowest. You can find a range for the company ID, but then have to scan all the rows for matching dates.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • But `index_b` also correspond exactly (same two columns that are used in the where clause), is that right? – Israel Fonseca Jul 06 '18 at 13:29
  • Same two columns, but less efficient order. An index is a total ordering, and in index B the rows are ordered like (00:00:00,A), (00:00:00,B), (00:00:01,A), (00:00:01,B), etc. You can select a range by date, but it will have both companies in it. – Matt Timmermans Jul 06 '18 at 14:34
  • Even when my "Bônus Question" states that the cardinality of the timestamp is much higher than the companies? That doesn't matter? – Israel Fonseca Jul 09 '18 at 12:53
  • 1
    @IsraelFonseca - Cardinality of the _components_ of a composite key does not matter. When comparing two _separate_ index choices, cardinality does matter. `INDEX(timestamp)` with `WHERE timestamp=...` is much better than `INDEX(company)` with `WHERE company=...` -- due to cardinality. But so what? They get different resultsets. OK, it would matter -- _if those were the only two indexes_. – Rick James Jul 09 '18 at 16:08
  • @RickJames, I've done a real life test now: `index(user_id, date, type)` vs `index(date, user_id, type)`, for a query that used those 3 columns in the WHERE restriction (date >= 6 month ago AND user_id in (ids) and type in (types)). Interesting enough, the index with `user_id` at first was much faster, even with lower cardinality. I forced the index usage, for testing purpose. Is that what you expected? I tested it on a 100 million rows table. – Israel Fonseca Aug 29 '18 at 22:37
  • 1
    @IsraelFonseca -- Since all three columns are used in something more complex than `=`, only the first column will be used very much for filtering. It _may_ be that the Optimizer will 'leapfrog' (cf MRR) through the table (or index). If so, then starting with an `IN` column is best (as you did). – Rick James Aug 30 '18 at 00:19