11

so i have two tables that i need to be able to get counts for. One of them holds the content and the other on the relationship between it and the categories table. Here are the DDl :

CREATE TABLE content_en (
    id int(11) NOT NULL AUTO_INCREMENT,
    title varchar(100) DEFAULT NULL,
    uid int(11) DEFAULT NULL,
    date_added int(11) DEFAULT NULL,
    date_modified int(11) DEFAULT NULL,
    active tinyint(1) DEFAULT NULL,
    comment_count int(6) DEFAULT NULL,
    orderby tinyint(4) DEFAULT NULL,
    settings text,
    permalink varchar(255) DEFAULT NULL,
    code varchar(3) DEFAULT NULL,
    PRIMARY KEY (id),
    UNIQUE KEY id (id),
    UNIQUE KEY id_2 (id) USING BTREE,
    UNIQUE KEY combo (id,active) USING HASH,
    KEY code (code) USING BTREE
) ENGINE=MyISAM AUTO_INCREMENT=127126 DEFAULT CHARSET=utf8;

and for the other table

CREATE TABLE content_page_categories (
    catid int(11) unsigned NOT NULL,
    itemid int(10) unsigned NOT NULL,
    main tinyint(1) DEFAULT NULL,
    KEY itemid (itemid),
    KEY catid (catid),
    KEY combo (catid,itemid) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

The query i'm running is :

SELECT count(*) 
FROM content_page_categories USE INDEX (combo) 
INNER JOIN content_en USE INDEX (combo) ON (id = itemid) 
WHERE catid = 1 AND active = 1 ;

Both tables have 125k rows and i can't get the count query to run fast enough. Best timing i get is 0.175 which is horrible for this ammount of rows. Selecting 100 rows is as fast as 0.01. I have tried like 3 or 4 variations of this query but in the end the timings are just about the same. Also, if i don't do USE INDEX timing goes 3x slower.

Also tried the following : SELECT COUNT( *) FROM content_page_categories INNER JOIN content_en ON id=itemid AND catid = 1 AND active = 1 WHERE 1

and :

SELECT SQL_CALC_FOUND_ROWS catid,content_en.* FROM content_page_categories INNER JOIN content_en ON (id=itemid) WHERE catid =1 AND active = 1 LIMIT 1; SELECT FOUND_ROWS();

Index definitions : content_en 0 PRIMARY 1 id A 125288 BTREE
content_en 0 id 1 id A 125288 BTREE
content_en 0 id_2 1 id A 125288 BTREE
content_en 0 combo 1 id A BTREE
content_en 0 combo 2 active A YES BTREE
content_en 1 code 1 code A 42 YES BTREE

content_page_categories 1 itemid 1 itemid A 96842 BTREE
content_page_categories 1 catid 1 catid A 10 BTREE
content_page_categories 1 combo 1 catid A 10 BTREE
content_page_categories 1 combo 2 itemid A 96842 BTREE

Any ideas?

[EDIT]

i have uploaded sample data for these tables here

result of explain :

mysql> explain SELECT count(*) FROM  content_page_categories USE INDEX (combo) I<br>
NNER JOIN content_en USE INDEX (combo) ON (id = itemid) WHERE  catid = 1 AND act<br>
ive = 1 ;

+----+-------------+-------------------------+-------+---------------+-------+---------+--------------------------+--------+--------------------------+
| id | select_type | table                   | type  | possible_keys | key   | key_len | ref                      | rows   | Extra                    |
+----+-------------+-------------------------+-------+---------------+-------+---------+--------------------------+--------+--------------------------+
|  1 | SIMPLE      | content_en              | index | combo         | combo | 6 | NULL                     | 125288 | Using where; Using index |
|  1 | SIMPLE      | content_page_categories | ref   | combo         | combo | 8 | const,mcms.content_en.id |      1 | Using where; Using index |
+----+-------------+-------------------------+-------+---------------+-------+---------+--------------------------+--------+--------------------------+
2 rows in set (0.00 sec)
RiaD
  • 46,822
  • 11
  • 79
  • 123
mbouclas
  • 634
  • 1
  • 12
  • 30

5 Answers5

12

I downloaded your data and tried a few experiments. I'm running MySQL 5.6.12 on a CentOS virtual machine on a Macbook Pro. The times I observed can be used for comparison, but your system may have different performance.

Base case

First I tried without the USE INDEX clauses, because I avoid optimizer overrides where possible. In most cases, a simple query like this should use the correct index if it's available. Hard-coding the index choice in a query makes it harder to use a better index later.

I also use correlation names (table aliases) to make the query more clear.

mysql> EXPLAIN SELECT COUNT(*) FROM content_en AS e  
INNER JOIN content_page_categories AS c ON c.itemid = e.id 
WHERE c.catid = 1 AND e.active = 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: c
         type: ref
possible_keys: combo,combo2
          key: combo
      key_len: 4
          ref: const
         rows: 71198
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: e
         type: eq_ref
possible_keys: PRIMARY,combo2,combo
          key: PRIMARY
      key_len: 4
          ref: test.c.itemid
         rows: 1
        Extra: Using where
  • This executed in 0.36 seconds.

Covering index

I'd like to get "Using index" on the second table as well, so I need an index on (active, id) in that order. I had to USE INDEX in this case to persuade the optimizer not to use the primary key.

mysql> ALTER TABLE content_en ADD KEY combo2 (active, id);

mysql> explain SELECT COUNT(*) FROM content_en AS e USE INDEX (combo2) 
INNER JOIN content_page_categories AS c ON c.itemid = e.id 
WHERE c.catid = 1 AND e.active = 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: c
         type: ref
possible_keys: combo,combo2
          key: combo
      key_len: 4
          ref: const
         rows: 71198
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: e
         type: ref
possible_keys: combo2
          key: combo2
      key_len: 6
          ref: const,test.c.itemid
         rows: 1
        Extra: Using where; Using index

The rows reported by EXPLAIN is an important indicator of how much work it's going to take to execute the query. Notice the rows in the above EXPLAIN is only 71k, much smaller than the 125k rows you got when you scanned the content_en table first.

  • This executed in 0.44 seconds. This is unexpected, because usually a query using a covering index is an improvement.

Convert tables to InnoDB

I tried the same covering index solution as above, but with InnoDB as the storage engine.

mysql> ALTER TABLE content_en ENGINE=InnoDB;
mysql> ALTER TABLE content_page_categories ENGINE=InnoDB;

This had the same EXPLAIN report. It took 1 or 2 iterations to warm the buffer pool, but then the performance of the query tripled.

  • This executed in 0.16 seconds.

  • I also tried removing the USE INDEX, and the time increased slightly, to 0.17 seconds.

@Matthew's solution with STRAIGHT_JOIN

mysql> SELECT straight_join count(*) 
 FROM content_en 
 INNER JOIN content_page_categories use index (combo) 
  ON (id = itemid) 
 WHERE catid = 1 AND active = 1;
  • This executed in 0.20 - 0.22 seconds.

@bobwienholt's solution, denormalization

I tried the solution proposed by @bobwienholt, using denormalization to copy the active attribute to the content_page_categories table.

mysql> ALTER TABLE content_page_categories ADD COLUMN active TINYINT(1);
mysql> UPDATE content_en JOIN content_page_categories ON id = itemid 
    SET content_page_categories.active = content_en.active;
mysql> ALTER TABLE content_page_categories ADD KEY combo3 (catid,active);
mysql> SELECT COUNT(*) FROM content_page_categories WHERE catid = 1 and active = 1;

This executed in 0.037 - 0.044 seconds. So this is better, if you can maintain the redundant active column in sync with the value in the content_en table.

@Quassnoi's solution, summary table

I tried the solution proposed by @Quassnoi, to maintain a table with precomputed counts per catid and active. The table should have very few rows, and looking up the counts you need are primary key lookups and require no JOINs.

mysql> CREATE TABLE page_active_category (
 active INT NOT NULL, 
 catid INT NOT NULL, 
 cnt BIGINT NOT NULL,
 PRIMARY KEY (active, catid) 
) ENGINE=InnoDB;

mysql> INSERT INTO page_active_category
 SELECT  e.active, c.catid, COUNT(*)
 FROM    content_en AS e
 JOIN    content_page_categories AS c ON c.itemid = e.id
 GROUP BY e.active, c.catid

mysql> SELECT cnt FROM page_active_category WHERE active = 1 AND catid = 1

This executed in 0.0007 - 0.0017 seconds. So this is the best solution by an order of magnitude, if you can maintain the table with aggregate counts.

You can see from this that different types of denormalization (including a summary table) is an extremely powerful tool for the sake of performance, though it has drawbacks because maintaining the redundant data can be inconvenient and makes your application more complex.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
5

There are too many records to count.

If you want a faster solution, you'll have to store aggregate data.

MySQL does not support materialized views (or indexed views in SQL Server's terms) so you would need to create and maintain them yourself.

Create a table:

CREATE TABLE
        page_active_category
        (
        active INT NOT NULL,
        catid INT NOT NULL,
        cnt BIGINT NOT NULL,
        PRIMARY KEY
                (active, catid)
        ) ENGINE=InnoDB;

then populate it:

INSERT
INTO    page_active_category
SELECT  active, catid, COUNT(*)
FROM    content_en
JOIN    content_page_categories
ON      itemid = id
GROUP BY
        active, catid

Now, each time you insert, delete or update a record in either content_en or content_page_categories, you should update the appropriate record in page_active_category.

This is doable with two simple triggers on both content_en and content_page_categories.

This way, your original query may be rewritten as mere:

SELECT  cnt
FROM    page_active_category
WHERE   active = 1
        AND catid = 1

which is a single primary key lookup and hence instant.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • I think that the way to go is the messy one (code-wise), that is the summary tables. I always hated summary tables cause they have a tendency to increase the number of bugs in the application. Thing is ,like i stated in my question, i have tried out every other solution - proposal stated in here and failed. 200ms for a single count is unacceptable, therefore the summary tables seem like a one way road. Thank you. – mbouclas Jul 10 '13 at 07:28
  • @mbouclas: in SQL Server and Oracle, you could have them auto-populated and auto-updated, just like indexes. MySQL needs such a feature badly too. – Quassnoi Jul 10 '13 at 08:53
1

The problem is the "active" column in content_en. Obviously, if you just needed to know how many content records were related to a particular category (active or not) all you would have to do is:

SELECT count(1)
FROM content_page_categories
WHERE catid = 1;

Having to join back to every content_en record just to read the "active" flag is really what is slowing this query down.

I recommend adding "active" to content_page_categories and making it a copy of the related value in content_en... you can keep this column up to date using triggers or in your code. Then you can change the combo index to be:

KEY combo (catid,active,itemid)

and rewrite your query to:

SELECT count(1)
FROM content_page_categories USE INDEX (combo)
WHERE catid = 1 AND active = 1;

Also, you may have much better luck using InnoDB tables instead of MyISAM. Just be sure to tune your InnoDB settings: http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/

bobwienholt
  • 17,420
  • 3
  • 40
  • 48
0

For me with your data as setup, I was getting the join query taking ~ 50x times longer than just selecting from the content_page_categories.

I was able to achieve performance about 10x slower than just selecting from the categories table by doing the following with your data:

I used straight_join

    SELECT straight_join count(*) 
    FROM content_en 
    INNER JOIN content_page_categories use index (combo) 
     ON (id = itemid) 
    WHERE catid = 1 AND active = 1 ;

and the following table structure (slightly modified):

 CREATE TABLE `content_en` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `title` varchar(100) DEFAULT NULL,
 `uid` int(11) DEFAULT NULL,
 `date_added` int(11) DEFAULT NULL,
 `date_modified` int(11) DEFAULT NULL,
 `active` tinyint(1) DEFAULT NULL,
 `comment_count` int(6) DEFAULT NULL,
 `orderby` tinyint(4) DEFAULT NULL,
 `settings` text,
 `permalink` varchar(255) DEFAULT NULL,
 `code` varchar(3) DEFAULT NULL,
 PRIMARY KEY (`id`),
 UNIQUE KEY `id` (`id`),
 KEY `test_con_1` (`active`) USING HASH,
 KEY `combo` (`id`,`active`) USING HASH
 ENGINE=MyISAM AUTO_INCREMENT=127126 DEFAULT CHARSET=utf8

And:

CREATE TABLE `content_page_categories` (
`catid` int(11) unsigned NOT NULL,
`itemid` int(10) unsigned NOT NULL,
`main` tinyint(1) DEFAULT NULL,
KEY `itemid` (`itemid`),
KEY `catid` (`catid`),
KEY `test_cat_1` (`catid`) USING HASH,
KEY `test_cat_2` (`itemid`) USING HASH,
KEY `combo` (`itemid`,`catid`) USING HASH
ENGINE=MyISAM DEFAULT CHARSET=utf8

To achieve better than this I think you will need a view, a flattened structure, or another type of look up field (as in the trigger to populate a row in the other table as discussed by another poster).

EDIT:

I should also point to this decent post on why/when to be careful with Straight_Join: When to use STRAIGHT_JOIN with MySQL

If you use it, use it responsibly!

Community
  • 1
  • 1
Matthew
  • 9,851
  • 4
  • 46
  • 77
0

to speed up counting on mysql joins use subqueries.

For example getting cities with placeCount

city table

id title ......

place table

id city_id title .....

SELECT city.title,subq.count as placeCount
FROM city
       left join (
         select city_id,count(*) as count from place
         group by city_id
  ) subq
on city.id=subq.city_id
Fuad All
  • 871
  • 11
  • 13