68

I'm trying to optimize some of the database queries in my Rails app and I have several that have got me stumped. They are all using an IN in the WHERE clause and are all doing full table scans even though an appropriate index appears to be in place.

For example:

SELECT `user_metrics`.* FROM `user_metrics` WHERE (`user_metrics`.user_id IN (N,N,N,N,N,N,N,N,N,N,N,N))

performs a full table scan and EXPLAIN says:

select_type: simple
type: all
extra: using where
possible_keys: index_user_metrics_on_user_id  (which is an index on the user_id column)
key: (none)
key_length: (none)
ref: (none)
rows: 208

Are indexes not used when an IN statement is used or do I need to do something differently? The queries here are being generated by Rails so I could revisit how my relationships are defined, but I thought I'd start with potential fixes at the DB level first.

Vladimir Vagaytsev
  • 2,871
  • 9
  • 33
  • 36
jasonlong
  • 801
  • 1
  • 7
  • 6

5 Answers5

52

See How MySQL Uses Indexes.

Also validate whether MySQL still performs a full table scan after you add an additional 2000-or-so rows to your user_metrics table. In small tables, access-by-index is actually more expensive (I/O-wise) than a table scan, and MySQL's optimizer might take this into account.

Contrary to my previous post, it turns out that MySQL is also using a cost-based optimizer, which is very good news - that is, provided you run your ANALYZE at least once when you believe that the volume of data in your database is representative of future day-to-day usage.

When dealing with cost-based optimizers (Oracle, Postgres, etc.), you need to make sure to periodically run ANALYZE on your various tables as their size increases by more than 10-15%. (Postgres will do this automatically for you, by default, whereas other RDBMSs will leave this responsibility to a DBA, i.e. you.) Through statistical analysis, ANALYZE will help the optimizer get a better idea of how much I/O (and other associated resources, such as CPU, needed e.g. for sorting) will be involved when choosing between various candidate execution plans. Failure to run ANALYZE may result in very poor, sometimes disastrous planning decisions (e.g. millisecond-queries taking, sometimes, hours because of bad nested loops on JOINs.)

If performance is still unsatisfactory after running ANALYZE, then you will typically be able to work around the issue by using hints, e.g. FORCE INDEX, whereas in other cases you might have stumbled over a MySQL bug (e.g. this older one, which could have bitten you were you to use Rails' nested_set).

Now, since you are in a Rails app, it will be cumbersome (and defeat the purpose of ActiveRecord) to issue your custom queries with hints instead of continuing to use the ActiveRecord-generated ones.

I had mentioned that in our Rails application all SELECT queries dropped below 100ms after switching to Postgres, whereas some of the complex joins generated by ActiveRecord would occasionally take as much as 15s or more with MySQL 5.1 because of nested loops with inner table scans, even when indices were available. No optimizer is perfect, and you should be aware of the options. Other potential performance issues to be aware of, besides query plan optimization, are locking. This is outside the scope of your problem though.

Jen S
  • 4,465
  • 1
  • 32
  • 28
vladr
  • 65,483
  • 18
  • 129
  • 130
17

Try forcing this index:

SELECT `user_metrics`.*
FROM `user_metrics` FORCE INDEX (index_user_metrics_on_user_id)
WHERE (`user_metrics`.user_id IN (N,N,N,N,N,N,N,N,N,N,N,N))

I just checked, it does use an index on exactly same query:

EXPLAIN EXTENDED
SELECT * FROM tests WHERE (test IN ('test 1', 'test 2', 'test 3', 'test 4', 'test 5', 'test 6', 'test 7', 'test 8', 'test 9'))

1, 'SIMPLE', 'tests', 'range', 'ix_test', 'ix_test', '602', '', 9, 100.00, 'Using where'
Quassnoi
  • 413,100
  • 91
  • 616
  • 614
9

Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)

What percentage of rows match your IN clause?

mluebke
  • 8,588
  • 7
  • 35
  • 31
  • My initial tests were on a table with only ~200 rows, so the percentage is relatively high. However, I've added 5000 more rows to test with and it still performs full table scans. The percentage now would be quite minuscule. – jasonlong Feb 26 '09 at 13:02
  • In my experience, the cutoff is _usually_ between 10% and 30%. – Rick James May 31 '17 at 04:49
8

I know I'm late for the party. But hope I can help someone else with similar problem.

Lately, I'm having the same problem. Then I decide to use self-join-thing to solve my problem. The problem is not MySQL. Problem is us. The return type from subquery is difference from our table. So we must cast the type of subquery to the type of select column. Below is example code:

select `user_metrics`.* 
from `user_metrics` um 
join (select `user_metrics`.`user_id` in (N, N, N, N) ) as temp 
on um.`user_id` = temp.`user_id`

Or my own code:

Old: (Not use index: ~4s)

SELECT 
    `jxm_character`.*
FROM
    jxm_character
WHERE
    information_date IN (SELECT DISTINCT
            (information_date)
        FROM
            jxm_character
        WHERE
            information_date >= DATE_SUB('2016-12-2', INTERVAL 7 DAY))
        AND `jxm_character`.`ranking_type` = 1
        AND `jxm_character`.`character_id` = 3146089;

New: (Use index: ~0.02s)

SELECT 
    *
FROM
    jxm_character jc
        JOIN
    (SELECT DISTINCT
        (information_date)
    FROM
        jxm_character
    WHERE
        information_date >= DATE_SUB('2016-12-2', INTERVAL 7 DAY)) AS temp 
        ON jc.information_date = STR_TO_DATE(temp.information_date, '%Y-%m-%d')
        AND jc.ranking_type = 1
        AND jc.character_id = 3146089;

jxm_character:

  • Records: ~3.5M
  • PK: jxm_character(information_date, ranking_type, character_id)

SHOW VARIABLES LIKE '%version%';

'protocol_version', '10'
'version', '5.1.69-log'
'version_comment', 'Source distribution'

Last note: Make sure you understand MySQL index left-most rule.

P/s: Sorry for my bad English. I post my code (production, of course) to clear my solution :D.

Liem Le
  • 581
  • 7
  • 17
  • `IN ( SELECT ... )` has been notorious for being poorly optimized. And you did the 'right thing' to turn it into a `JOIN`. – Rick James May 31 '17 at 04:47
  • You really saved my day :D . My table is getting around 1million rows and started to get very slow and this really saved my day! – Aung Myint Thein Jan 08 '21 at 04:49
-1

Does it get any better if you remove the redundant brackets around the where clause?

Although it could just be that because you've only got 200 or so rows, it decided a table scan would be faster. Try with a table with more records in it.

Paul Tomblin
  • 179,021
  • 58
  • 319
  • 408
  • Extra parens don't seem to matter. Also, the small dataset doesn't seem to matter - I added an additional 5000 records and still scans them all. – jasonlong Feb 25 '09 at 18:41