Optimizing a PHP page: MySQL bottleneck

Question

I have a page that is taking 37 seconds to load. While it is loading it pegs MySQL's CPU usage through the roof. I did not write the code for this page and it is rather convoluted so the reason for the bottleneck is not readily apparent to me.

I profiled it (using kcachegrind) and find that the bulk of the time on the page is spent doing MySQL queries (90% of the time is spent in 25 different mysql_query calls).

The queries take the form of the following with the tag_id changing on each of the 25 different calls:

SELECT * FROM tbl_news WHERE news_id
 IN (select news_id from
 tbl_tag_relations WHERE tag_id = 20)

Each query is taking around 0.8 seconds to complete with a few longer delays thrown in for good measure... thus the 37 seconds to completely load the page.

My question is, is it the way the query is formatted with that nested select that is causing the problem? Or could it be any one of a million other things? Any advice on how to approach tackling this slowness is appreciated.

Running EXPLAIN on the query gives me this (but I'm not clear on the impact of these results... the NULL on primary key looks like it would be bad, yes? The number of results returned seems high to me as well as only a handful of results are returned in the end):

1    PRIMARY     tbl_news   ALL NULL    NULL    NULL    NULL    1318    Using where
2   DEPENDENT SUBQUERY  tbl_tag_relations   ref FK_tbl_tag_tags_1   FK_tbl_tag_tags_1   4   const   179 Using where

can you post the indexes on the tbl_news and tbl_tag_relations tables? the explain says which indexes will be used, but not what their components are. I'd guess this is a 'missing index' issue (though running 25 separate queries isn't good either) — Jonathan Rupp, Apr 26 '09 at 00:00
You were correct, Jonathan. It was a missing index issue. Cletus pointed out which one and his suggestion resulted in a 10 fold speed up. — rg88, Apr 26 '09 at 04:32

score 5 · Accepted Answer · edited May 23 '17 at 12:01

5

I'e addressed this point in Database Development Mistakes Made by AppDevelopers. Basically, favour joins to aggregation. IN isn't aggregation as such but the same principle applies. A good optimize will make these two queries equivalent in performance:

SELECT * FROM tbl_news WHERE news_id
 IN (select news_id from
 tbl_tag_relations WHERE tag_id = 20)

and

SELECT tn.*
FROM tbl_news tn
JOIN tbl_tag_relations ttr ON ttr.news_id = tn.news_id
WHERE ttr.tag_id = 20

as I believe Oracle and SQL Server both do but MySQL doesn't. The second version is basically instantaneous. With hundreds of thousands of rows I did a test on my machine and got the first version to sub-second performance by adding appropriate indexes. The join version with indexes is basically instantaneous but even without indexes performs OK.

By the way, the above syntax I use is the one you should prefer for doing joins. It's clearer than putting them in the WHERE clause (as others have suggested) and the above can do certain things in an ANSI SQL way with left outer joins that WHERE conditions can't.

So I would add indexes on the following:

tbl_news (news_id)
tbl_tag_relations (news_id)
tbl_tag_relations (tag_id)

and the query will execute almost instantaneously.

Lastly, don't use * to select all the columns you want. Name them explicitly. You'll get into less trouble as you add columns later.

edited May 23 '17 at 12:01

Community

1
1

answered Apr 26 '09 at 00:46

cletus

616,129
168
910
942

+1 Your point is taken well, but I don't think cletus ever uses words like disabuse :) – cgp Apr 26 '09 at 00:59
I would say that the above syntax is more explicit, not necessarily more clear. – cgp Apr 26 '09 at 01:01
It's a good word. :) Alas I had to remove it since I went and checked and surprisingly MySQL does treat IN differently to joins (which I don't believe Oracle does). Edited accordingly. – cletus Apr 26 '09 at 01:03
Interesting. There were already keys set for all of those (by tbl_rows I assume you mean tbl_news) except for news_id on tbl_tag_relations. Adding only that index cut page load time in half... it's now at 16 seconds. I'm amazed that such a tiny change has such a massive impact on performance. Thank you very much... I'm off to read up on indexes a bit. – rg88 Apr 26 '09 at 01:24
Er yeah, getting my wires crossed with something else. Fixed. I'm surprised your query still takes time in the seconds range. How many rows in tbl_news and tbl_tag_relations? I tried with 100k and 600k records equivalent and get it in 0.7 seconds. Of course, the joins are instantaneous. – cletus Apr 26 '09 at 01:36
Well don't forget that I'm talking about total page load time taking so long, not just the sql queries. When I profiled the page I saw that those queries were what was taking up the bulk of the time. Currently the bulk is taken up by calls to our ad server (third party) which is something I can live with. Your index suggestion cut queries from taking 90% of the load time to taking only 9%... that’s pretty solid (and thanks very much for that). For the record tbl_news has about 1,400 records and tbl_tag_relations has about 9,000. – rg88 Apr 26 '09 at 04:28
Ah gotcha. That makes more sense (total load time vs query time). Glad it worked out for you. – cletus Apr 26 '09 at 05:02

Andy Moore · Answer 2 · 2009-04-26T00:13:10.153

The SQL Query itself is definitely your bottleneck. The query has a sub-query in it, which is the IN(...) portion of the code. This is essentially running two queries at once. You can likely halve (or more!) your SQL times with a JOIN (similar to what d03boy mentions above) or a more targeted SQL query. An example might be:

SELECT * 
FROM tbl_news, tbl_tag_relations 
WHERE tbl_tag_relations.tag_id = 20 AND
tbl_news.news_id = tbl_tag_relations.news_id

To help SQL run faster you also want to try to avoid using SELECT *, and only select the information you need; also put a limiting statement at the end. eg:

SELECT news_title, news_body 
... 
LIMIT 5;

You also will want to look into the database schema itself. Make sure you are indexing all of the commonly referred to columns so that the queries will run faster. In this case, you probably want to check your news_id and tag_id fields.

Finally, you will want to take a look at the PHP code and see if you can make one single all-encompassing SQL query instead of iterating through several seperate queries. If you post more code we can help with that, and it will probably be the single greatest time savings for your posted problem. :)

This isn't the problem. Your query will be optimised and executed the same as the original. You've just rewritten IN as a join, which is arguably wrong because the second table isn't selected at all. — cletus, Apr 26 '09 at 00:42
The index Cletus suggested really did the trick. You are right about the SELECT * and I'm trying to find a simple way to take care of that but the code is pretty wild so changing it will probably affect queries I've not yet imagined... I'll have to be careful. — rg88, Apr 26 '09 at 04:30

Joe Phillips · Answer 3 · 2009-04-26T00:13:19.110

2

If I understand correctly, this is just listing the news stories for a specific set of tags.

First of all, you really shouldn't ever SELECT *
Second, this can probably be
accomplished within a single query, thus reducing the overhead cost of
multiple queries. It seems like it is getting fairly trivial data so it could be retrieved within a single call instead of 20.
A better approach to using IN might be to use a JOIN with a WHERE condition instead. When using an IN it will basically be a lot of OR statements.
Your tbl_tag_relations should definitely have an index on tag_id

edited Apr 26 '09 at 00:13

answered Apr 25 '09 at 23:54

Joe Phillips

49,743
32
103
159

If you write a statement like that as an answer, you should explain why (at least in a few words) or provide better examples. Not everyone knows these things, and the OP clearly said that s/he doesn't really know much about the database. – Jason Coco Apr 25 '09 at 23:59
I'm down with giving examples, and frankly, I can do without the listing of the fields. – cgp Apr 26 '09 at 00:01
Is there a specific reason why SELECT * is undesirable? How is SELECT * different from explicitly listing every single field in the table? – Calvin Apr 26 '09 at 01:04
1

It's not different. It makes it difficult to know what those fields are without looking at the table though. It's poor documentation. It may cause more data to be sent, hindering speed, etc. – Joe Phillips Apr 26 '09 at 01:55

score 1 · Answer 4 · answered Apr 26 '09 at 00:04

1

select * 
 from tbl_news, tbl_tag_relations 
 where 
      tbl_tag_relations.tag_id = 20 and 
      tbl_news.news_id = tbl_tag_relations.news_id 
 limit 20

I think this gives the same results, but I'm not 100% sure. Sometimes simply limiting the results helps.

answered Apr 26 '09 at 00:04

cgp

41,026
12
101
131

It will give all fields in tbl_tag_relations which technically wouldn't be the same results but there's probably only two fields in there anyway – Joe Phillips Apr 26 '09 at 00:06
And limiting the results would provide less results. – Joe Phillips Apr 26 '09 at 00:14
Oh wait, I forgot, there might not be 20 results to begin with, which is certainly possible, so it might be the same. :) – cgp Apr 26 '09 at 00:16
That does give the same results and much faster to boot! – rg88 Apr 26 '09 at 04:31

score 1 · Answer 5 · answered Apr 26 '09 at 00:46

Unfortunately MySQL doesn't do very well with uncorrelated subqueries like your case shows. The plan is basically saying that for every row on the outer query, the inner query will be performed. This will get out of hand quickly. Rewriting as a plain old join as others have mentioned will work around the problem but may then cause the undesired affect of duplicate rows.

For instance the original query would return 1 row for each qualifying row in the tbl_news table but this query:

SELECT news_id, name, blah
FROM tbl_news n
JOIN tbl_tag_relations r ON r.news_id = n.news_id
WHERE r.tag_id IN (20,21,22)

would return 1 row for each matching tag. You could stick DISTINCT on there which should only have a minimal performance impact depending on the size of the dataset.

Not to troll too badly, but most other databases (PostgreSQL, Firebird, Microsoft, Oracle, DB2, etc) would handle the original query as an efficient semi-join. Personally I find the subquery syntax to be much more readable and easier to write, especially for larger queries.

I have used PostgreSQL and it is powerful but changing databases simply isn't an option. — rg88, Apr 26 '09 at 01:58

Optimizing a PHP page: MySQL bottleneck

5 Answers5