Higher cardinality column first in an index when involving a range?

Question

CREATE TABLE `files` (
  `did` int(10) unsigned NOT NULL DEFAULT '0',
  `filename` varbinary(200) NOT NULL,
  `ext` varbinary(5) DEFAULT NULL,
  `fsize` double DEFAULT NULL,
  `filetime` datetime DEFAULT NULL,
  PRIMARY KEY (`did`,`filename`),
  KEY `fe` (`filetime`,`ext`),          -- This?
  KEY `ef` (`ext`,`filetime`)           -- or This?
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;

There are a million rows in the table. The filetimes are mostly distinct. There are a finite number of ext values. So, filetimehas a high cardinality and ext has a much lower cardinality.

The query involves both ext and filetime:

WHERE ext = '...'
  AND filetime BETWEEN ... AND ...

Which of those two indexes is better? And why?

Rick James · Accepted Answer · 2021-01-06T00:19:39.703

First, let's try FORCE INDEX to pick either ef or fe. The timings are too short to get a clear picture of which is faster, but `EXPLAIN shows a difference:

Forcing the range on filetime first. (Note: The order in WHERE has no impact.)

mysql> EXPLAIN SELECT COUNT(*), AVG(fsize)
    FROM files FORCE INDEX(fe)
    WHERE ext = 'gif' AND filetime >= '2015-01-01'
                      AND filetime <  '2015-01-01' + INTERVAL 1 MONTH;
+----+-------------+-------+-------+---------------+------+---------+------+-------+-----------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                 |
+----+-------------+-------+-------+---------------+------+---------+------+-------+-----------------------+
|  1 | SIMPLE      | files | range | fe            | fe   | 14      | NULL | 16684 | Using index condition |
+----+-------------+-------+-------+---------------+------+---------+------+-------+-----------------------+

Forcing the low-cardinality ext first:

mysql> EXPLAIN SELECT COUNT(*), AVG(fsize)
    FROM files FORCE INDEX(ef)
    WHERE ext = 'gif' AND filetime >= '2015-01-01'
                      AND filetime <  '2015-01-01' + INTERVAL 1 MONTH;
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                 |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
|  1 | SIMPLE      | files | range | ef            | ef   | 14      | NULL |  538 | Using index condition |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+

Clearly, the rows says ef is better. But let's check with the Optimizer trace. The output is rather bulky; I'll show only the interesting parts. No FORCE is needed; the trace will show both options then pick the better.

             ...
             "potential_range_indices": [
                ...
                {
                  "index": "fe",
                  "usable": true,
                  "key_parts": [
                    "filetime",
                    "ext",
                    "did",
                    "filename"
                  ]
                },
                {
                  "index": "ef",
                  "usable": true,
                  "key_parts": [
                    "ext",
                    "filetime",
                    "did",
                    "filename"
                  ]
                }
              ],

...

              "analyzing_range_alternatives": {
                "range_scan_alternatives": [
                  {
                    "index": "fe",
                    "ranges": [
                      "2015-01-01 00:00:00 <= filetime < 2015-02-01 00:00:00"
                    ],
                    "index_dives_for_eq_ranges": true,
                    "rowid_ordered": false,
                    "using_mrr": false,
                    "index_only": false,
                    "rows": 16684,
                    "cost": 20022,               <-- Here's the critical number
                    "chosen": true
                  },
                  {
                    "index": "ef",
                    "ranges": [
                      "gif <= ext <= gif AND 2015-01-01 00:00:00 <= filetime < 2015-02-01 00:00:00"
                    ],
                    "index_dives_for_eq_ranges": true,
                    "rowid_ordered": false,
                    "using_mrr": false,
                    "index_only": false,
                    "rows": 538,
                    "cost": 646.61,               <-- Here's the critical number
                    "chosen": true
                  }
                ],

...

          "attached_conditions_computation": [
            {
              "access_type_changed": {
                "table": "`files`",
                "index": "ef",
                "old_type": "ref",
                "new_type": "range",
                "cause": "uses_more_keyparts"   <-- Also interesting
              }
            }

With fe (range column first), the range could be used, but it estimated scanning through 16684 rows fishing for ext='gif'.

With ef (low cardinality ext first), it could use both columns of the index and drill down more efficiently in the BTree. Then it found an estimated 538 rows, all of which are useful for the query -- no further filtering needed.

Conclusions:

INDEX(filetime, ext) used only the first column.
INDEX(ext, filetime) used both columns.
Put columns involved in = tests first in the index regardless of cardinality.
The query plan won't go beyond the first 'range' column.
"Cardinality" is irrelevant for composite indexes and this type of query.

("Using index condition" means that the Storage Engine (InnoDB) will use columns of the index beyond the one used for filtering.)

The relevant part from the documentation: "The optimizer attempts to use additional key parts to determine the interval as long as the comparison operator is `=`, `<=>`, or `IS NULL`. If the operator is `>`, `<`, `>=`, `<=`, `!=`, `<>`, `BETWEEN`, or `LIKE`, the optimizer uses it but considers no more key parts." ([Range Optimization](https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html#range-access-multi-part)) — Paul Spiegel, May 08 '18 at 18:43
Nice benchmark! You said: "it could use both columns of the index and drill down more efficiently in the BTree". Do you know why? Because algorithmic speaking, MySQL could (and should) use both columns anyway, right? Do you recommend any book about the subject (inner workings of queries/indexes)? — Israel Fonseca, Jul 09 '18 at 13:21
@IsraelFonseca - Think of it this way -- the two columns could be concatenated together to make a single column. Poof, the discussion of cardinality vanished, yet the structure and speed of the BTree is virtually identical. — Rick James, Jul 09 '18 at 15:49
@RickJames I'm kinda new to query optimization on MySQL, how does you get the "Optimizer trace" on MySQL? I wish to see the " cost" value on my queries. — ogabriel, May 05 '21 at 21:55
https://dev.mysql.com/doc/internals/en/optimizer-tracing.html — Rick James, May 06 '21 at 01:25
A more concise answer to turning on Optimizer trace: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#optimizer_trace — Rick James, Oct 12 '21 at 20:02

Higher cardinality column first in an index when involving a range?

1 Answers1

Linked