How to optimize a count SQL query on a big table

Question

I have a big table on mysql (innodb) which contains products assets (13 millions of rows). Here a little schema of my database :

product <-many2one-- file_item --one2many--> family --many2one--> download_type

The *file_item* table is the big table with millions of rows. I try to count products by download types with the following sql query :

select t.name as type, 
count(p.product_id) as n 
from file_item p 
inner join family f on f.id = p.family_id 
inner join type t on f.id_type = t.id 
group by t.id order by t.name;

There are 3 indexes on *file_item* table:

product_family_idx (product_id, family_id)
family_idx (family_id)
product_idx (product_id) Explain output :

+----+-------------+-------+--------+-----------------------------------+---------+---------+-------------------+----------+---------------------------------+
| id | select_type | table | type   | possible_keys                     | key     | key_len | ref               | rows     | Extra                           |
+----+-------------+-------+--------+-----------------------------------+---------+---------+-------------------+----------+---------------------------------+
|  1 | SIMPLE      | p     | ALL    | FAMILY_IDX,PRODUCT_FAMILY_IDX     | NULL    | NULL    | NULL              | 13862870 | Using temporary; Using filesort | 
|  1 | SIMPLE      | f     | eq_ref | PRIMARY,TYPE_ID                   | PRIMARY | 4       | MEDIA.p.FAMILY_IDX|        1 |                                 | 
|  1 | SIMPLE      | t     | eq_ref | PRIMARY                           | PRIMARY | 4       | MEDIA.f.TYPE_ID   |        1 |                                 | 
+----+-------------+-------+--------+-----------------------------------+---------+---------+-------------------+----------+---------------------------------+

The query takes more than 1 hour to return the results. Please how I can optimize the query ?!

Are the tables indexed well? Please provide schema with indexes and EXPLAIN of that query. Thanks! — Wiseguy, Aug 24 '11 at 16:21
Here's an easy way to find the bottleneck: run the query piece by piece. Just do "select count() from file_item" and make sure that returns quickly. Then add the family join. Then the type join. Then the group by. Then the order by. Figure out what addition causes the slowdown and go from there. — Derek, Aug 24 '11 at 16:48
I think the `ORDER BY` is killing it if there's no index on `t.name`. If you simply remove the `ORDER BY`, is it much faster? — Wiseguy, Aug 24 '11 at 16:52
Make sure to not just look at you file_item table. "familiy" might need some index including "id_type" to help the join and "type" might need an index on "name" to help the sort. (Or not ... as @Derek says, look at the pieces and their exceution plans) — alun, Aug 24 '11 at 16:55
Given the query, and without the schema, I understand that you have a file_item->many2one->family->many2one->type (that is, only 1 family per file_item and 1 type per family), is that correct ? — Arnaud Le Blanc, Aug 24 '11 at 17:10
@Wiseguy I tried to remove the `ORDER BY`, and the query takes always a while to returns the results. — juliusdev, Aug 24 '11 at 17:12
@arnaud576875 A product have files and each file have 1 family. Family are typed. I have 10 download types, 5000 families, 13M files, and 250K products .... I hope the count will not return 1 :-) — juliusdev, Aug 24 '11 at 17:23

score 5 · Answer 1 · answered Aug 24 '11 at 17:18

Here is your original query:

select t.name as type,  
count(p.product_id) as n  
from file_item p  
inner join family f on f.id = p.family_id  
inner join type t on f.id_type = t.id  
group by t.id order by t.name;

You will need to make two major changes:

MAJOR CHANGE # 1 : Refactor the Query

SELECT A.ProductCount,B.name type
FROM
(
    SELECT id_type id,COUNT(1) ProductCount
    FROM
    (
        SELECT p.id_type
        FROM (SELECT family_id,id_type FROM file_item) p
        INNER JOIN (SELECT id FROM family) f on f.id = p.family_id
    ) AA
    GROUP BY id_type
) A
INNER JOIN type B USING (id)
ORDER BY B.name;

MAJOR CHANGE # 2 : Create Indexes That Will Support the Refactored Query

ALTER TABLE file_item ADD INDEX family_type_idx (family_id,id_type);

Give it a Try !!!

In your query, GROUP BY and ORDER BY clauses are evaluated after all joins. The trick is to do two things: 1) force the query to have smaller temp tables using only the needed keys, 2) perform JOINs dead last. I learned it from this video : http://youtu.be/ZVisY-fEoMw. I used this technique to answer another very complex question involving thousansd of rows in StackOverflow : http://stackoverflow.com/questions/5983156/fetching-a-single-row-from-join-table/6023217#6023217 — RolandoMySQLDBA, Aug 24 '11 at 17:39

Arnaud Le Blanc · Answer 2 · 2011-08-24T18:13:44.103

Lets decompose the query into parts:

First, fetch each row of file_item => 13M rows
For each returned row, fetch a row of family matching f.id = p.family_id. => 13M fetches, 13M rows
For each returned row, fetch a row of type matching f.id_type = t.id. => 13M fetches, 13M rows
Group by type.id => 10 rows
Sort by type.name => 10 rows to sort

As you can see, your query needs fetch 13M rows from family and 13M rows from type.

You should start be reducing the number of row fetches needed to execute the query:

Assuming that f.id_type is a non-NULL foreign key, you can change the inner join type t to a left join type t. Then, change group by t.id to group by f.id_type.

Grouping on the f table instead of the t table and changing the inner join to a left join allows MySQL to execute the group by before fetching rows from t.

group by drastically reduces the number of rows, so this drastically reduce the number of fetches from t too:

First, fetch each row of file_item => 13M rows
For each returned row, fetch a row of family matching f.id = p.family_id. => 13M fetches, 13M rows
Group by type.id => 10 rows
For each returned row, fetch a row of type matching f.id_type = t.id. => 10 fetches, 10 rows
Sort by type.name => 10 rows to sort

The result is that the query already fetches 13M less rows.

You can reduce that even more by denormalizing the schema a little:

If you add a family_type_id column in file_item, you could rewrite your query like this:

SELECT count(1)
FROM file_item p
JOIN type t ON t.id = p.family_type_id
GROUP BY p.family_type_id
ORDER BY t.name

With an index on file_item.family_type_id, this query should execute in milliseconds.

How to optimize a count SQL query on a big table

2 Answers2