You should try a covering index like
CREATE INDEX commits_proj_comm_cndx ON commits(committer_id, project_id);
Why
Your query:
- selects
project_id
out of the table
- chooses using
committer_id
- groups using
project_id
By creating an index where your constant WHERE
clauses (here, committer_id
) are in the first positions, you allow MySQL to quickly zero on those records which will be involved, before accessing the table. Once it has these records, MySQL can start work. But if in the index there is also the information used for grouping, i.e., project_id
, MySQL can start grouping data even before accessing the table. Finally, if all the other information is also present on the index (here this is already done, for it is the same information used by the WHERE
), MySQL has no need to access the table at all. An index supplying this information in this order is a covering index for this query.
Of course this is the more advantageous, the more the data you use is small in respect to the full table row; clearly, if you have a 100 GB table and a 75 GB index, the performance gain is small. If you have a 100 GB table and a 1 GB index, you're in for a big win. Especially if the query on the 1 GB index has a low cardinality (e.g. the committer is only responsible for 1% of the data). Then you're reading 10 MB of data instead of 100 GB, and you won't believe the performance increase.
Once you have the index, you do:
SELECT project_id , COUNT(1) AS commit_count
FROM commits
WHERE committer_id = 1891264
GROUP BY project_id
This ought to run on the index only.
I said, try, because maintaining indexes has a cost too. It might well be the case that you speed up this SELECT query, at the expense of slowing down INSERTs and UPDATEs, because they need to manage the index as well as the table.
By the way, the COUNT(1) is a personal taste - many regard asterisks in queries as an antipattern, and by using COUNT(1), the query will not show in a grep
search, allowing to focus on queries with significant asterisks.
Also, when you do tests like this, remember to do1:
- test on slow_id
- test on different_id
- test again on slow_id with a slightly different query (e.g. adding an AS.. alias)
The reason for the third test is that when running the first query MySQL will also load the data in the pool (if using InnoDB, which you don't) and in RAM as well, so that the second query might be faster because pool, memory and any I/O cache have been primed. If so, the third query will also run faster and - provided it is then comparable to the second query of course - the "this query is slow" problem will be revealed as an artifact.
But since queries are cached, you want the same data to be fetched again, not the result being pulled from the cache. So the third query will need to be slightly different to throw the cache off. Otherwise you'd see what seems to be a very fast query, when it actually is anything but.
(1) this is if you run a quick test. Otherwise performance testing is way more complicated than this; the Percona blog has several articles on this topic.