SQL WHERE IN ( ID1, ID2, ID3 ) vs WHERE UNIQUE_ID = IDX

Question

I have created a table with some dummy data. The table (ARTICLES) consists of id, author_id, title, description and the table (AUTHOR) consists of author_id, name, article_list.

As per application flow first I would take out the list of authors, this will give me author name and article list and id. When the user navigates inside author I can get the list of all articles in two way.

First

SELECT * FROM  articles WHERE  author_id = 100;

and secondly, if I keep all list of all articles in form of a list inside my author table then I can use

SELECT * 
  FROM  articles 
 WHERE id IN (100, 1100, 2100, 3100, 4100, 5100, 6100, 
            7100, 8100, 9100, 10100, 11100, 12100, 13100, 
            14100, 15100, 16100, 17100, 18100, 19100, 20100,
            21100, 22100, 23100, 24100, 25100, 26100, 27100,
            28100, 29100, 30100, 31100, 32100, 33100, 34100);

The first query took 0.0329 sec while the second query took 0.0017 sec.

I am not able to understand how is it possible that the first query is taking more time than the second query.

All I know the second query will execute like

SELECT * 
  FROM articles 
 WHERE id = 100 
    OR id = 1100 
    OR id = 2100... and so on

You can ask the database to `EXPLAIN` how the query is being run. It all depends on statistics and other things — Sami Kuhmonen, Jan 15 '19 at 06:25
DB query performance depends on a lot of factors, most notably on the presence and definition of indexes and the size of tables in relation to available memory for the DB engine. For example, if the ARTICLES table has an index on `id` but none on `author_id` it is pretty normal that the second query performs faster. If the whole table fits in memory then the first query may incur the cost of reading it, while the second benefits from the data being in memory already. — Hans-Martin Mosner, Jan 15 '19 at 06:30
Thanks a lot after adding the index to author_id in the table (ARTICLES), The time taken for the first query is 0.0008 seconds while the second query is still executed in 0.0019 seconds. Anything else that I should know for optimizing the same. — Saurabh Pandey, Jan 15 '19 at 06:36
Do not store article ids within a field in the author table. The idea is really that bad, see https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad — Shadow, Jan 15 '19 at 06:41
@Shadow, your statement is true, but not related to this post. I don't think the OP is storing a comma-separated list of id's in one column. — Bill Karwin, Jan 15 '19 at 06:53
For future reference, questions about query performance always require SHOW CREATE TABLE statements for all relevant tables, as well as the EXPLAIN for the given queries. — Strawberry, Jan 15 '19 at 08:21
@BillKarwin I disagree, see his statement of `" if I keep all list of all articles in form of a list inside my author table"` — Shadow, Jan 15 '19 at 12:17

score 0 · Answer 1 · answered Jan 15 '19 at 17:26

Caching.

If you start up the server, then run a query, nothing is yet in the buffer_pool (or table cache or ...). So several files need opening and several blocks need reading. 32.9ms could indicate that you needed to hit the disk (if HDD) 3 times at about 10ms each.

If you run the identical query a second time, everything will be cached, and it will take only a few milliseconds, typically under 10ms.

Since the first query primed the cache with some stuff, the second query found most, maybe all, the blocks it needed. So, it was probably CPU-only, no I/O. 1.7ms is reasonable.

A possible issue... Do you have the "Query cache" turned on? If so, then (in certain situations) a subsequent execution of any SELECT will find the resultset in the QC and return very fast, possibly < 1ms. One way to be sure to avoid the QC (for realistic timing) is to do SELECT SQL_NO_CACHE ....

The OR query you present is optimized into the IN that you present. That is, they end up being identical. (Using OR with different columns is a performance killer; that is not the situation here.)

Timing tips:

Use SELECT SQL_NO_CACHE ... (to avoid the QC)
Run the query twice and use the second timing. (This avoids other caching issues.)

Now to analyze what happens if you do not have any index on author_id.

Both queries will scan the entire table.
Apparently it is a relatively small table since you could read it all in 32.9ms.
The diff between 32.9 and 1.7 is just caching. (I have a Rule of Thumb that says that the ratio between those is 10. For this metric, 32.9/1.7 is close enough to my RoT to support it.)

If you had INDEX(author_id), both queries would probably run faster, cached or not.

score -1 · Answer 2 · answered Jan 15 '19 at 08:15

This can be because there can be thousands of author_ids and for:

 SELECT * FROM  articles WHERE  author_id = 100;

Every row needs to be traversed because its applied on the entire column rows

And for:

 SELECT * FROM  articles WHERE id IN (100, 1100, 2100, 3100, 4100, 5100, 6100, 7100, 8100, 9100, 10100, 11100, 12100, 13100, 14100, 15100, 16100, 17100, 18100, 19100, 20100, 21100, 22100, 23100, 24100, 25100, 26100, 27100, 28100, 29100, 30100, 31100, 32100, 33100, 34100);

They are limited records and a single check of id can help in faster traversing by using memory.

SQL WHERE IN ( ID1, ID2, ID3 ) vs WHERE UNIQUE_ID = IDX

2 Answers2