2

Table Schema

database engine is MyISAM

commits (about 500,000,000 rows)
id (int) - PK (indexed)
project_id (int) - FK [project->id] (indexed)
committer_id - FK [user->id] (indexed)
...

projects (about 32,000,000 rows)
id (int) - PK (indexed)
...

user (about 12,000,000 rows)
id - PK (indexed)
...


Explain

SELECT COUNT(*) 
FROM commits 
WHERE committer_id = 30351173

above query is complete in 10 sec.

SELECT project_id , COUNT(*) as commit_count
FROM commits
WHERE committer_id = 30351173
GROUP BY project_id

but, above query is not complete in 2 hours(7,200 sec)

project_id in commits is complete indexing column, but the speed is too slow.

Why is this happening?

Q) How do I speed up the second query?


Additional

SELECT project_id , COUNT(*) AS commit_count
FROM commits
WHERE committer_id = 1891264
GROUP BY project_id

I query another commiter_id is complete 15 sec.

Additional - 2

EXPLAIN SELECT COUNT(*) FROM commits WHERE committer_id = 30351173

output

enter image description here

EXPLAIN SELECT project_id , COUNT(*) AS commit_count
FROM commits
WHERE committer_id = 30351173
GROUP BY project_id

output

enter image description here

Ethan Choi
  • 2,339
  • 18
  • 28
  • 1
    The really important column is `committer_id`. Is it indexed? You put a condition on in! – juergen d Feb 03 '17 at 08:38
  • Add an index to `committer_id` and it should speed things up. – Tim Biegeleisen Feb 03 '17 at 08:39
  • Additionally, as against `COUNT(*)`, you could as well COUNT the unique identifier of each row, if there exists one - think you have `id`? – Dhruv Saxena Feb 03 '17 at 08:40
  • yes. `commiter_id` is indexed(BTREE) column. foreign key user->id – Ethan Choi Feb 03 '17 at 08:40
  • Put ```EXPLAIN``` before ```SELECT``` and paste here the output. Thank you. – SyncroIT Feb 03 '17 at 08:40
  • Whats is the storage engine InnoDB or MyISAM? COUNT(*) is not optimized for InnoDB – Raymond Nijland Feb 03 '17 at 08:46
  • If you ask a question about speed, then always attach the results of explain of the relevant queries and all indexes of the relevant tables. Until then it is a guesswork. My guess is that you need a composite index on committer_id - project_id fields to speed this up. – Shadow Feb 03 '17 at 08:51
  • Show us the explain for both queries and post the results here? So `EXPLAIN SELECT project_id , COUNT(*) as commit_count FROM commits WHERE committer_id = 30351173 GROUP BY project_id` and `EXPLAIN `SELECT project_id , COUNT(*) AS commit_count FROM commits WHERE committer_id = 1891264 GROUP BY project_id` – Raymond Nijland Feb 03 '17 at 08:53
  • @Syncro `EXPLAIN SELECT COUNT(*) FROM commits WHERE committer_id = 30351173` output 1 SIMPLE commits ref fk_committer_id fk_committer_id 5 const 8454686 100.00 Using index – Ethan Choi Feb 03 '17 at 08:55
  • @Ethan Choi and the EXPLAINS of the GROUP BY queries?. – Raymond Nijland Feb 03 '17 at 09:02
  • Wait, those numbers... are you by any chance modeling GitLab? :-) – LSerni Feb 03 '17 at 09:04
  • @LSerni That's right. I am analyzing Git data :). – Ethan Choi Feb 03 '17 at 09:08

2 Answers2

7

You should try a covering index like

CREATE INDEX commits_proj_comm_cndx ON commits(committer_id, project_id);

Why

Your query:

  • selects project_id out of the table
  • chooses using committer_id
  • groups using project_id

By creating an index where your constant WHERE clauses (here, committer_id) are in the first positions, you allow MySQL to quickly zero on those records which will be involved, before accessing the table. Once it has these records, MySQL can start work. But if in the index there is also the information used for grouping, i.e., project_id, MySQL can start grouping data even before accessing the table. Finally, if all the other information is also present on the index (here this is already done, for it is the same information used by the WHERE), MySQL has no need to access the table at all. An index supplying this information in this order is a covering index for this query.

Of course this is the more advantageous, the more the data you use is small in respect to the full table row; clearly, if you have a 100 GB table and a 75 GB index, the performance gain is small. If you have a 100 GB table and a 1 GB index, you're in for a big win. Especially if the query on the 1 GB index has a low cardinality (e.g. the committer is only responsible for 1% of the data). Then you're reading 10 MB of data instead of 100 GB, and you won't believe the performance increase.

Once you have the index, you do:

SELECT project_id , COUNT(1) AS commit_count
    FROM commits
    WHERE committer_id = 1891264
GROUP BY project_id

This ought to run on the index only.

I said, try, because maintaining indexes has a cost too. It might well be the case that you speed up this SELECT query, at the expense of slowing down INSERTs and UPDATEs, because they need to manage the index as well as the table.

By the way, the COUNT(1) is a personal taste - many regard asterisks in queries as an antipattern, and by using COUNT(1), the query will not show in a grep search, allowing to focus on queries with significant asterisks.


Also, when you do tests like this, remember to do1:

  • test on slow_id
  • test on different_id
  • test again on slow_id with a slightly different query (e.g. adding an AS.. alias)

The reason for the third test is that when running the first query MySQL will also load the data in the pool (if using InnoDB, which you don't) and in RAM as well, so that the second query might be faster because pool, memory and any I/O cache have been primed. If so, the third query will also run faster and - provided it is then comparable to the second query of course - the "this query is slow" problem will be revealed as an artifact.

But since queries are cached, you want the same data to be fetched again, not the result being pulled from the cache. So the third query will need to be slightly different to throw the cache off. Otherwise you'd see what seems to be a very fast query, when it actually is anything but.

(1) this is if you run a quick test. Otherwise performance testing is way more complicated than this; the Percona blog has several articles on this topic.

LSerni
  • 55,617
  • 10
  • 65
  • 107
-1

This is because with

group by

statement

SQL-Server have to go through every sigle row in your Table.

Possibly a index on project_id will solve the problem.

alter table commits add index (committer_id);

  • 2
    The OP is using MySQL, not SQL Server. Regardless, the `WHERE` clause is the critical piece here most likely, implying an index on the `committer_id` column is what matters. – Tim Biegeleisen Feb 03 '17 at 08:46