This question is a more specific version of a previous question I asked
Table
CREATE TABLE Test4_ClusterMatches
(
`match_index` INT UNSIGNED,
`cluster_index` INT UNSIGNED,
`id` INT NOT NULL AUTO_INCREMENT,
`tfidf` FLOAT,
PRIMARY KEY (`cluster_index`,`match_index`,`id`)
);
The query I want to run
mysql> explain SELECT `match_index`, SUM(`tfidf`) AS total
FROM Test4_ClusterMatches WHERE `cluster_index` IN (1,2,3 ... 3000)
GROUP BY `match_index`;
The Problem with the query
It uses temporary and filesort so its to slow
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | Test4_ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 51540 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
With the current indexing the query would need to sort by cluster_index first to eliminate the use of temporary and filesort, but doing so gives the wrong results for sum(tfidf).
Changing the primary key to
PRIMARY KEY (`match_index`,`cluster_index`,`id`)
Doesn't use file sort or temp tables but it uses 14,932,441 rows so it is also to slow
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| 1 | SIMPLE | Test5_ClusterMatches | index | NULL | PRIMARY | 16 | NULL | 14932441 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
Tight Index Scan
Using tight index scan by running the search for just one index
mysql> explain SELECT match_index
, SUM(tfidf
) AS total
FROM Test4_ClusterMatches WHERE cluster_index
=3000
GROUP BY match_index
;
Eliminates the temporary tables and filesort.
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| 1 | SIMPLE | Test4_ClusterMatches | ref | PRIMARY | PRIMARY | 4 | const | 27 | Using where; Using index |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
I'm not sure if this can be exploited with some magic sql-fu that I haven't come across yet?
Question
How can I change my query so that it use 3,000 cluster_indexes, avoids using temporary and filesort without it needing to use 14,932,441 rows?
Update
Using the table
CREATE TABLE Test6_ClusterMatches
(
match_index INT UNSIGNED,
cluster_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (id),
UNIQUE KEY(cluster_index,match_index)
);
The query below then gives 10 rows in set (0.41 sec) :)
SELECT `match_index`, SUM(`tfidf`) AS total FROM Test6_ClusterMatches WHERE
`cluster_index` IN (.....)
GROUP BY `match_index` ORDER BY total DESC LIMIT 0,10;
but its using temporary and filesort
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| 1 | SIMPLE | Test6_ClusterMatches | range | cluster_index | cluster_index | 5 | NULL | 78663 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
I'm wondering if theres anyway to get it faster by eliminating the using temporary and using filesort?