2

Background

  • users table has 2k rows
  • relationships table has 1.5 million rows
  • posts table has 2 million rows
  • using mysql version 5.7.34

Structure for users:

CREATE TABLE `users` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `email` varchar(255) NOT NULL DEFAULT '',
  `first_name` varchar(255) NOT NULL DEFAULT '',
  `last_name` varchar(255) NOT NULL DEFAULT '',
  `password` varchar(255) NOT NULL DEFAULT '',
  `active` tinyint(1) NOT NULL,
  `created_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `email` (`email`)
) ENGINE=InnoDB AUTO_INCREMENT=3263 DEFAULT CHARSET=utf8

Structure for relationships:

CREATE TABLE `relationships` (
  `user_id` int(11) unsigned NOT NULL,
  `is_following_user_id` int(11) unsigned NOT NULL,
  `created_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  UNIQUE KEY `user_id` (`user_id`,`is_following_user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Structure for posts:

CREATE TABLE `posts` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `user_id` int(11) unsigned NOT NULL,
  `parent_post_id` int(11) DEFAULT NULL,
  `content` varchar(255) DEFAULT '',
  `created_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `user_id` (`user_id`),
  CONSTRAINT `users` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2412061 DEFAULT CHARSET=utf8

NOTE: User 922 has no relationships or posts, therefore the queries are doing a full index and/or table scan as needed.


This query takes 0.5ms:

# 0.5ms
select * from posts where user_id in (
    select id from users inner join relationships
        on users.id = relationships.is_following_user_id
        where relationships.user_id = 922
);

Explain output for above fast query: MySQL explain output for fast query

This query takes 500ms:

# 500ms
select * from posts where user_id in (
    select id from users inner join relationships
        on users.id = relationships.is_following_user_id
        where relationships.user_id = 922
    )
or user_id = 922;

Explain output for above slow query: MySQL explain output for slow query

Clearly for the second query it has identified the same index as the first query (users.user_id), but in the second query, as per the explain output, it's specifically avoiding using it (key = NULL).

This query takes 2.3 seconds:

# 2.3 seconds
select * from posts where user_id in (
    select id from users inner join relationships
        on users.id = relationships.is_following_user_id
        where relationships.user_id = 922
    union all
    select 922
);

Explain output for above super slow query: enter image description here

Questions:

  1. Why is query #2 not using the users.user_id index like query #1?
  2. Why is query #3 so much slower, and also not using the users.user_id index?
Milos Ivanovic
  • 592
  • 3
  • 9
  • 21
  • Related: https://stackoverflow.com/questions/13894064/mysql-or-operator-not-using-index, https://stackoverflow.com/questions/38113747/in-clause-not-using-index – raina77ow Jun 20 '21 at 06:24
  • For the sake of completeness I'd suggest trying a 4th query, unioning results of two `select * from posts`: one with where on subquery check, another for `where user_id = 922` check. – raina77ow Jun 20 '21 at 06:26
  • 1
    Aside from your question, why dont you simplify your first query as `SELECT * FROM posts INNER JOIN relationships ON posts.user_id = relationships.is_following_user_id WHERE relationships.user_id = 922` – Indra Kumar S Jun 20 '21 at 06:40
  • As an aside, (user_id,is_following_user_id) is primary – Strawberry Jun 20 '21 at 07:57
  • The bottom line is that MySQL's Optimizer does a crappy job of handling `IN ( SELECT ... )`; avoid it! Almost always reformulation as a `JOIN` helps. (Sometimes `EXISTS( SELECT ... )` can work well.) – Rick James Jun 20 '21 at 22:20
  • Also, before 5.7.3, `UNION` always involved creating a temp table; now it avoids the temp table in _some_ cases. – Rick James Jun 20 '21 at 22:22

3 Answers3

1

In general, the answers to your questions are that the query optimizer will do its best but only looks for a limited number of special cases, and those typically don't include merging different sources of key values, and often do include trying to convert parts of a subquery to joins, sometimes to the detriment of efficiency.

You can likely force it to do what you want with:

select straight_join p.*
from (
    select id from users inner join relationships on users.id = relationships.is_following_user_id where relationships.user_id = 922
    union all
    select 922
) ids
join posts p on p.user_id=ids.id
ysth
  • 96,171
  • 6
  • 121
  • 214
  • 1
    `You can likely force it` ... but in general forcing the use of an index isn't best practice (in most situations). – Tim Biegeleisen Jun 20 '21 at 06:30
  • @TimBiegeleisen with that sentiment, you might be better off with postgres. I always have a query plan in mind when I write sql, balancing what I know to be possible best and worst cases, and do whatever I can to force that. and yes, sometimes need to repeat that on a version upgrade – ysth Jun 20 '21 at 06:32
  • @ysth Thank you, your query has the expected performance. I am shocked that what is a simple query with, in my opinion, an _extremely_ clear use for an index in the `where` clause would result in the query optimiser forcefully avoiding using that very index -- even `force index (user_id)` which plenty of people would immediately scrutinise isn't enough conviction for it to change its mind. – Milos Ivanovic Jun 20 '21 at 07:18
1

OR and IN ( SELECT ... ) are rarely optimized well.

Flattening the structure (less nesting of SELECTs) seems to help.

( SELECT p.*
    FROM relationships AS r
    JOIN users AS u  ON u.id = r.is_following_user_id
    JOIN posts AS p  ON p.user_id = u.id
    WHERE r.user_id = 922
) UNION ALL   -- see note
( SELECT *
    FROM posts
    WHERE user_id = 922
)

Note: UNION ALL is likely to be faster than UNION DISTINCT. Assuming no user is following theirself, it is 'correct' to use ALL.

(I have listed the tables in the JOIN in the order that the Optimizer will decide to use. As per ysth, I like to 'think like the optimizer'.)

Dissecting the queries, with an eye to MySQL's Optimizer...

  • relationships will be looked at first -- because it is the only thing found in WHERE. INDEX(user_id, ...) makes that work well.
  • An added boost is that INDEX(user_id, is_following_user_id) is "covering".
  • The next table will be users with PRIMARY KEY(id).
  • Finally (for the first SELECT), posts via INDEX(user_id).
  • The other select also uses INDEX(user_id). (However, in many UNIONs, a different index is likely to be used. This is why UNION is often a significant optimization over OR.)
Rick James
  • 135,179
  • 13
  • 127
  • 222
0

I would even take the pre-query (from query) a step further. COMPLETELY REMOVE the users join. No need for it. The relationships table has an ID for who is following, and the user doing the following. That can be simplified to

select straight_join 
      p.*
   from 
      posts p 
         JOIN ( select r.is_following_user_id
                   from relationships r
                   where r.user_id = 922
                union all
                select 922 ) ids
         on p.user_id = ids.is_following_user_id

Since the inner query you were filtering on the relationship USER_ID = 922, the "is_following_user_id" IS the other person you want. No need to join to users table just to get the "id" column. It's still and ID column, just longer name. The union pulls the 922 as intended. So now all your IDs are simplified, no JOIN. Outer results pulls for those IDs.

DRapp
  • 47,638
  • 12
  • 72
  • 142