2

I'm struggling with setting up proper, effective index for my Django application which uses MySQL database. The problem is about article table which for now has a little more than 1 million rows and querying isn't as fast as we want.

Article table structure looks more or less like below:

Field   Type
id  int
date_published  datetime(6) 
date_retrieved  datetime(6) 
title   varchar(500)    
author  varchar(200)    
content longtext    
source_id   int
online  tinyint(1)
main_article_of_duplicate_group tinyint(1)

After many tries I came that below index gives best performance:

CREATE INDEX search_index ON newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online);

And the problematic query is:

SELECT 
    `newsarticle`.`id`,
    `newsarticle`.`url`,
    `newsarticle`.`date_published`,
    `newsarticle`.`date_retrieved`,
    `newsarticle`.`title`,
    `newsarticle`.`summary_provided`,
    `newsarticle`.`summary_generated`,
    `newsarticle`.`source_id`,
    COUNT(CASE WHEN `newsarticlefeedback`.`is_relevant` THEN `newsarticlefeedback`.`id` ELSE NULL END) AS `count_relevent`,
    COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`,
    (
      SELECT U0.`is_relevant`
      FROM `newsarticlefeedback` U0
      WHERE (U0.`news_id_id` = `newsarticle`.`id` AND U0.`user_id_id` = 27)
      ORDER BY U0.`created_date` DESC
      LIMIT 1
    ) AS `is_relevant`,
    CASE
        WHEN `newsarticle`.`content` = '' THEN 0
        ELSE 1
    END AS `is_content`,
    `newsproviders_newsprovider`.`id`,
    `newsproviders_newsprovider`.`name_long`
FROM
    `newsarticle` USE INDEX (SEARCH_INDEX)
        INNER JOIN
    `newsarticle_topics` ON (`newsarticle`.`id` = `newsarticle_topics`.`newsarticle_id`)
        LEFT OUTER JOIN
    `newsarticlefeedback` ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)
        LEFT OUTER JOIN
    `newsproviders_newsprovider` ON (`newsarticle`.`source_id` = `newsproviders_newsprovider`.`id`)
WHERE
    ((1)
        AND `newsarticle`.`main_article_of_duplicate_group`
        AND `newsarticle`.`online`
        AND `newsarticle_topics`.`newstopic_id` = 42
        AND `newsarticle`.`date_published` >= '2020-08-08 08:39:03.199488')
GROUP BY `newsarticle`.`id`
ORDER BY `newsarticle`.`date_published` DESC
LIMIT 30

NOTE: that I have to use the index explicitly, otherwise query is muuch slower. This query takes about 1.4s.

But when I only remove GROUP BY statement the query takes acceptable 1-10ms. I was trying to add newsarticle ID to index at different positions but without a luck.

This is output from EXPLAIN (from Django):

ID  SELECT_TYPE TABLE   PARTITIONS  TYPE    POSSIBLE_KEYS   KEY KEY_LEN REF ROWS    FILTERED    EXTRA
1   PRIMARY newsarticle_topics  None    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk_summarize   newsartic_newstopic_id_ddd996b6_fk_summarize    4   const   312628  100.0   Using temporary; Using filesort
1   PRIMARY newsarticle None    eq_ref  PRIMARY,newsartic_source_id_6ea2b978_fk_summarize,newsartic_topic_id_b67ae2c9_fk_summarize,kek,last_updated,last_update,search_index,fulltext_idx_content   PRIMARY 4   newstech.newsarticle_topics.newsarticle_id  1   22.69   Using where
1   PRIMARY newsarticlefeedback None    ref newsartic_news_id_id_5af7594b_fk_summarize  newsartic_news_id_id_5af7594b_fk_summarize  5   newstech.newsarticle_topics.newsarticle_id  1   100.0   None
1   PRIMARY newsproviders_newsprovider  None    eq_ref  PRIMARY,    PRIMARY 4   newstech.newsarticle.source_id  1   100.0   None
2   DEPENDENT SUBQUERY  U0  None    ref newsartic_news_id_id_5af7594b_fk_summarize,newsartic_user_id_id_fc217cfe_fk_auth_user   newsartic_user_id_id_fc217cfe_fk_auth_user  5   const   1   10.0    Using where; Using filesort

Interesting that same query gives different EXPLAIN in MySQL Workbench and in Django debug toolbar(if you want I can paste EXPLAIN from workbench as well). But the performance is more or less the same. Do you maybe have an idea how to enhance index so it can search quickly?

Thanks

EDIT: I paste here EXPLAIN from MySQL Workbench which is different but seems to be more real (not sure why Django debug toolbar explain differently)

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle NULL    range   PRIMARY,newsartic_source_id_6ea2b978_fk_,newsartic_topic_id_b67ae2c9_fk,kek,last_updated,last_update,search_index,fulltext_idx_content  search_index    8   NULL    227426  81.00   Using index condition; Using MRR; Using temporary; Using filesort
1   PRIMARY newsarticle_topics  NULL    eq_ref  newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq    8   newstech.newsarticle.id,const   1   100.00  Using index
1   PRIMARY newsarticlefeedback NULL    ref newsartic_news_id_id_5af7594b_fk    newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   100.00  NULL
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,newsartic_user_id_id_fc217cfe_fk_auth_user newsartic_user_id_id_fc217cfe_fk_auth_user  5   const   1   10.00   Using where; Using filesort

EDIT2: Below is EXPLAIN when I remove GROUP BY from the query (used MySQL Workbench):

id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,newsarticle,NULL,range,search_index,search_index,8,NULL,227426,81.00,"Using index condition"
1,SIMPLE,newsarticle_topics,NULL,eq_ref,"newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk",newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,8,"newstech.newsarticle.id,const",1,100.00,"Using index"
1,SIMPLE,newsarticlefeedback,NULL,ref,newsartic_news_id_id_5af7594b_fk,newsartic_news_id_id_5af7594b_fk,5,newstech.newsarticle.id,1,100.00,"Using index"
1,SIMPLE,newsproviders_newsprovider,NULL,eq_ref,"PRIMARY,",PRIMARY,4,newstech.newsarticle.source_id,1,100.00,NULL

EDIT3:

After applying changes suggested by Rick (Thanks!):

newsarticle(id, online, main_article_of_duplicate_group, date_published) two index for newsarticle_topics (newstopic_id, newsarticle_id) and (newsarticle_id, newstopic_id)

WITH USE_INDEX (takes 1.2s)

EXPLAIN:

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle_topics  NULL    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   opposite    4   const   346286  100.00  Using index; Using temporary; Using filesort
1   PRIMARY newsarticle NULL    ref search_index    search_index    4   newstech.newsarticle_topics.newsarticle_id  1   27.00   Using index condition
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   10.00   Using where
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   90.00   Using where

WITHOUT USE_INDEX clause (takes 2.6s)

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle_topics  NULL    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   opposite    4   const   346286  100.00  Using index; Using temporary; Using filesort
1   PRIMARY newsarticle NULL    eq_ref  PRIMARY,search_index    PRIMARY 4   newstech.newsarticle_topics.newsarticle_id  1   27.00   Using where
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   10.00   Using where
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   90.00   Using where

For comparison index - newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online) with USE INDEX (takes only 1-3ms!)

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle NULL    range   search_index    search_index    8   NULL    238876  81.00   Using index condition
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
1   PRIMARY newsarticle_topics  NULL    eq_ref  newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq    8   newstech.newsarticle.id,const   1   100.00  Using index
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  6   newstech.newsarticle.id,const   1   100.00  Using index
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   90.00   Using where; Using index
Bob
  • 103
  • 8
  • Is the query without group by only fast because you’re only fetching the first 30 results? How fast is it to get all the results? – Andrew Sayer Oct 31 '20 at 22:18
  • @AndrewSayer If I remove the LIMIT clause as well, then "Duration" is still a few milliseconds but "Fetch" takes around 5 seconds (returns ~35k rows). – Bob Oct 31 '20 at 22:53
  • The first row in the plan indicates it's searching the newsarticle_topics table and looking at over 300k rows, using a temp file. This will be slow. How is that table defined and indexed? – schtever Oct 31 '20 at 23:18
  • A `group by` needs to sort the results to perform the aggregation. What is the explain without the `GROUP BY`? `date_published` should be last in your `search_index` index. Avoid forcing indexes. – danblack Oct 31 '20 at 23:22
  • @schtever yeah... EXPLAIN from django-debug-toolbar looks weird... I've just put EXPLAIN from MySQL Workbench for same query to main question. This explain seems more real(at least for me). Performance in workbench is the same. In question I described how the table look like and what index is being used. There is also fulltext index on content field. – Bob Oct 31 '20 at 23:41
  • @danblack Thanks. I tried to put `date_published` at the end but it was slower. I feel that forcing index is not best practice, but I'm not sure how to set this index so the optimizer will choose it automatically. Under EDIT2 in main question I put explain without `GROUP BY`. – Bob Oct 31 '20 at 23:51
  • 1
    Your current query should be failing, please edit and fix it - you are not grouping by some columns in `newsarticle` that you are not aggregating. In terms of indexing, you need to follow the Golden Rule of Indexing, as you add columns from left to right, they can only be used to effectively access the index if the column to the left is filtered with an equality condition. This means a more sensible one would lead with `newstopic_id` then `date_published`. Your `not null` filtered columns could be included later if they are actually useful in reducing the amount of data read from the table. – Andrew Sayer Nov 01 '20 at 00:04
  • @AndrewSayer thanks for your reply. The query works (I just tested it). Django adds GROUP BY statement for getting `count_relevant`, `count_nonrelevant` and `is_relevant`. Relation `newsarticle`<=>`newstopic` is many-to-many hence there is no `topic_id` column in `newsarticle` table to add to index. – Bob Nov 01 '20 at 10:57
  • `is_relevant` isn't an aggregation so it would either need to group by that scalar subquery too or if it's clever enough just `newsarticle.id`. Ah, I misread where that `newstopic_id` filter was coming from. Is it providing decent selectivity, e.g would it be a good idea to start by getting all the rows in `newsarticle_topics` that matches this filter then joining back to `newsarticle` using `newsarticle_id`? Are you likely to find less rows in `newsarticle` like that than your date filter? Or maybe filters are going to find lots of rows on their own but combined they really cut it down? – Andrew Sayer Nov 01 '20 at 11:10
  • I've checked it: - without `date_published` filter I receive ~172k rows. - without filtering by topic id I receive ~105k rows. Yes, thanks to combined filters we can reduce no. of rows to ~35k. Also when I run the same query as in question but without `LIMIT`, it takes exactly the same time compared to query with `LIMIT`. – Bob Nov 01 '20 at 11:48

3 Answers3

1

Is main_article_of_duplicate_group a true/false flag?

If the Optimizer chooses to start with newsarticle_topics:

 newsarticle_topics:  INDEX(newstopic_id, newsarticle_id)
 newsarticle:  INDEX(newsarticle_id, online,
                     main_article_of_duplicate_group, date_published)

If newsarticle_topics is a many-to-many mapping table, get rid of id and make the PRIMARY KEY be that pair, plus a secondary index in the opposite direction. More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table

If the Optimizer chooses to start with newsarticle (which seems more likely):

 newsarticle_topics:  INDEX(newsarticle_id, newstopic_id)
 newsarticle:  INDEX(online, main_article_of_duplicate_group, date_published)

Meanwhile, newsarticlefeedback needs this, in the order given:

INDEX(news_id_id, user_id_id, created_date, isrelevant)

Instead of

    COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`,
    LEFT OUTER JOIN  `newsarticlefeedback`
          ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)

have

    ( SELECT COUNT(*) FROM newsarticlefeedback
          WHERE `newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`
    ) AS `count_nonrelevent`,
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Yes, `main_article_of_duplicate_group` is boolean flag. I applied index for newsarticle and newsarticle_topics, `newsarticle_topics` is first for optimizer after I removed USE INDEX clause -> and it's really good except for topics which quite huge amount of articles related (takes ~2s). `1 PRIMARY newsarticle_topics None ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite opposite 4 const 346286 100.0 Using index; Using temporary; Using filesort` `1 PRIMARY newsarticle None eq_ref PRIMARY,search_index PRIMARY 4 newstech.newsarticle_topics.newsarticle_id 1 27.0 Using where` – Bob Nov 02 '20 at 15:59
  • Unfortunately Django doesn't support composite key for primary key. And yes, the biggest performance boost gives change like you suggested in `newsarticlefeedback` table :) – Bob Nov 02 '20 at 16:01
  • @Bob - Thanks for the feedback. Which thing gave the boost for newsarticlefeedback? The composite index or the reformulation (or both)? – Rick James Nov 02 '20 at 16:18
  • Major impact came from reformulation because it removes `GROUP BY`, but index also helped :) Changing only these two things and having my old `use index` clause and old indexes works quite fast, do you think it is possible to adjust somehow your proposition(indexes for newsarticle & newstopic) to get rid of `use index`? Thank you – Bob Nov 02 '20 at 16:30
  • @Bob - Quite possibly. It is 'trivial' to test -- Run `EXPLAIN SELECT ...` with and without the `USE INDEX()`. If the outputs are the same, the `USE` should be removed. If they are different, please show both; I may (or may not) have further advice. – Rick James Nov 02 '20 at 16:34
  • Thanks. So without `USE INDEX` the only different is it used `PRIMARY` key instead of index we specified (id, online, main_article_of_duplicate_group, date_published) Explain for second line which includes diff line: `1 PRIMARY newsarticle NULL eq_ref PRIMARY,search_index PRIMARY 4 newstech, newsarticle_topics.newsarticle_id 1 27.00 Using where` – Bob Nov 02 '20 at 17:20
  • Any chance you could add both EXPLAINs to your Question. Indent each row 4 characters to get formatting. – Rick James Nov 02 '20 at 17:34
  • Yes, sorry. I've just updated question post under "EDIT3". There are two explains after applying indexes you suggested(with and without `use index` clause) and also I put my index which is really performant but I need to add `USE INDEX` clause there – Bob Nov 02 '20 at 18:02
  • @Bob - You may have found a rare case where `USE INDEX` is "necessary". But, caution: changing the 42 to a number that more (or less) selective may cause the `USE INDEX` to be in the way of optimal performance. – Rick James Nov 02 '20 at 20:04
1

I happen to have a technique that works well with "news articles" that are categorized, filtered, and ordered by date. It even handles "embargo", "expired", soft "deleted", etc.

The big goal is to touch only 30 rows when performing

ORDER BY `newsarticle`.`date_published` DESC
LIMIT 30

But currently the WHERE clause must look at two tables to do the filtering. And that leads to touching 35K, or probably more, rows.

It requires building a simple table on the side that has 3 columns:

  • topic (or other filtering category),
  • date (for fetching only the latest 30),
  • article_id (for doing only 30 JOINs to get the rest of the article info)

Suitable indexing on that table makes the search very efficient.

With suitable DELETEs in this table, simple flags like online or main_article can be efficiently handled. Do not include flags in this extra table; instead do not include any rows that should not be shown.

More details: http://mysql.rjweb.org/doc.php/lists

(I have watched other "news" sites to meltdown by not having this technique being used.)

Note that the difference between 30 and 35K is about 1000x.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Hmm it's very interesting. I will try it. Thanks for spending your time helping me here :) – Bob Nov 03 '20 at 15:04
  • Hey again, I've just wanted to let you know that your technique is really great and works super fast. Many thanks! – Bob Nov 03 '20 at 17:58
  • 1
    @Bob - Thanks for the feedback. You implemented it rapidly. I guess that also says that my document was clear. – Rick James Nov 04 '20 at 00:05
0

Finally, I figured out what is the problem with this query. First of all, In Django GROUP BY statement is added automatically when using Count in annotation. So the easiest solution was to avoid it by nested annotations.

This is well explained in the answer here https://stackoverflow.com/a/43771738/4464554

Thanks everyone for you time and help :)

Bob
  • 103
  • 8