4

My db schema consists of the following two tables:

CREATE TABLE `categories` (
  `id` bigint(20) NOT NULL auto_increment,
  `title` varchar(128) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

and

CREATE TABLE `articles` (
  `id` bigint(20) NOT NULL auto_increment,
  `title` varchar(512) NOT NULL,
  `body` longtext,
  `state` varchar(7) NOT NULL,
  `type` varchar(6) NOT NULL,
  `category` bigint(20) default NULL,
  `publishedAt` datetime default NULL,
  PRIMARY KEY  (`id`),
  KEY `FK_category_to_article_category` (`category`),
  CONSTRAINT `FK_category_to_article_category` FOREIGN KEY (`category`) REFERENCES `categories` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

For articles table, state column has values like "PUBLISHED" or "UNPUBLISHED" and type column has values like "NEWS", "GOSSIP" and "OPINION".

My application performs a lot of queries like this:

select * from articles where state="PUBLISHED" and type in ("NEWS","GOSSIP") 
and category in (4) and publishedAt<=now() order by publishedAt desc;

I have ~10K articles and I am trying to determine whether the query above performs better with the default foreign key on category, or I should use a multi-column index instead.

Without an index (using "explain extended" ):

+----+-------------+-------+------+---------------------------------+---------------------------------+---------+-------+------+-----------------------------+
| id | select_type | table | type | possible_keys                   | key                             | key_len | ref   | rows | Extra                       |
+----+-------------+-------+------+---------------------------------+---------------------------------+---------+-------+------+-----------------------------+
|  1 | SIMPLE      | this_ | ref  | FK_category_to_article_category | FK_category_to_article_category | 9       | const |  630 | Using where; Using filesort |
+----+-------------+-------+------+---------------------------------+---------------------------------+---------+-------+------+-----------------------------+

If I create the multi-column index and explain again (forcing the specific index):

create index I_s_t_c_p on articles (state, type, category, publishedAt);


+----+-------------+-------+-------+---------------+-----------+---------+------+------+------------------------------------------+
| id | select_type | table | type  | possible_keys | key       | key_len | ref  | rows | Extra                                    |
+----+-------------+-------+-------+---------------+-----------+---------+------+------+------------------------------------------+
|  1 | SIMPLE      | this_ | range | I_s_t_c_p     | I_s_t_c_p | 61      | NULL | 1216 | Using where; Using index; Using filesort |
+----+-------------+-------+-------+---------------+-----------+---------+------+------+------------------------------------------+

The number of rows the query actually returns is 630. It seems to me that the multi-column index should perform better than the FK since all indexed columns are used, but the fact that ~1200 rows are examined when using the index confuses me. I know that these numbers are just estimations, but the difference between the two keys is pretty big; with the combined index, we have the double amount of rows examined.

So my questions are the following:

  1. Why are so many rows examined with the multi-column index?
  2. Since using an FK we have a join type "ref" and using the combined index we have a join type "range", does this mean that the query that uses the FK is better/faster than the other one?
  3. Should I use the estimation for the number of rows examined as a criteria to decide if an index is good/optimal?
  4. In this use-case, is the multi-column index better that the FK? On what basis should i make the decision?

Some additional information:

  • Without forcing an index on the query, optimizer chose the FK. When I performed an analyze table on articles, the multi-column index was chosen instead.
  • I am using MySql 5.0.15
  • index information

+----------+------------+---------------------------------+--------------+-------------+-------------+------------+
| Table    | Non_unique | Key_name                        | Seq_in_index | Column_name | Cardinality | Index_type |
+----------+------------+---------------------------------+--------------+-------------+-------------+------------+
| articles |          0 | PRIMARY                         |            1 | id          |       12561 | BTREE      |
| articles |          1 | FK_category_to_article_category |            1 | category    |          37 | BTREE      |
| articles |          1 | I_s_t_c_p                       |            1 | state       |           8 | BTREE      |
| articles |          1 | I_s_t_c_p                       |            2 | type        |          32 | BTREE      |
| articles |          1 | I_s_t_c_p                       |            3 | category    |         163 | BTREE      |
| articles |          1 | I_s_t_c_p                       |            4 | publishedAt |       12561 | BTREE      |
+----------+------------+---------------------------------+--------------+-------------+-------------+------------+

Thanks in advance.

Argyro Kazaki
  • 631
  • 2
  • 6
  • 15
  • Idea: if `state` and `type` can only take a bounded set of values, you could make those an ENUM or some other integral type, which would be faster to compare than a `VARCHAR`. – Kerrek SB Jul 14 '11 at 16:29
  • Order matters in multi key indexes. Try (publishedAt,category,type,state) for your index. – bot403 Jul 14 '11 at 16:32
  • @bot403: As I replied to Mchl (see below), `publishedAt` seems unnecessary, but the index order with the category column first, gives worse results than the one with the state first. Could this be related to the amount of articles? Maybe with ~50K articles, an index with the category first could display better results? – Argyro Kazaki Jul 15 '11 at 13:13
  • @Kerrek SB: I was not aware of the the mysql ENUM type, so thanks for the insight. However I am using hibernate and it seems that mysql ENUM is not well supported (http://stackoverflow.com/questions/2160700/enum-in-hibernate-persisting-as-an-enum). Plus from another discussion (http://stackoverflow.com/questions/766299/mysql-enum-performance-advantage) it seems that even if it will help the performance of my select query, I will slow down any inserts and updates. – Argyro Kazaki Jul 15 '11 at 13:30

1 Answers1

2

As you can see index on publishedAt has the same cardinality as PK. This doesn't really help. I would try to create a compound index with columns in that order (category,type,state). This way, the first part of the index is the most selective.

Mchl
  • 61,444
  • 9
  • 118
  • 120
  • Exactly what I would have done too. Base the index on whatever first element would have the smallest entries returned based on a commonly used criteria condition. – DRapp Jul 14 '11 at 17:06
  • You are right about the `publishedAt`. Removing it from the index, immediately reduced the number of rows examined to half - same as with the FK. Plus testing it in ~20K articles, the index performs better than the FK - which is the logical assumption. However, the index order you suggest -`(category,type,state)`- seems to perform same or worse then the `(state, type, category, publishedAt)` index. I understand that "the first part of the index should be the most selective", but I am confused because I see worse results. Could it be because `state` is a single equality constraint? – Argyro Kazaki Jul 15 '11 at 13:08
  • 1
    Unfortunately index selectivity is only a clue as to how index would perform. If your data is not evenly distributed among categories, it might not perform all that well. – Mchl Jul 15 '11 at 13:58