1

About tableA

  • On MySQL 5.7.38 (AWS RDS)
  • Contains +6.000.000 rows
  • Property last_activity_at is set as an INDEX and has datetime format
  • Property project_uid is set as an INDEX and is part of a Foreign Key

I'm performing the following SQL query

SELECT  project_uid, COUNT(uid) AS "count"
    FROM  tableA
    WHERE  last_activity_at >= "2023-05-01 00:00:00"
      AND  last_activity_at <= "2023-05-31 23:59:59"
    GROUP BY  project_uid

Here is the result (only performance)

117 rows in set (5 min 49,69 sec)

Here is the EXPLAIN results of the query above

+----+-------------+--------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+---------+----------+-------------+
| id | select_type | table  | partitions | type  | possible_keys                                                                                                               | key                            | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+--------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | tableA | NULL       | index | IDX_2dd9918084b7ec339b29123b7e,IDX_acd46110be997a5678aa809f1c,FK_c0bc0d43ba64ca68b4fc7b7a095,IDX_8d2721775c4645ba9c620ec9fa | FK_c0bc0d43ba64ca68b4fc7b7a095 | 147     | NULL | 6235932 |    50.00 | Using where |
+----+-------------+--------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+---------+----------+-------------+

Here is the result without group by (which is still a lot for a simple where query)

1376493 rows in set (31,74 sec)

Here is the EXPLAIN of the query without GROUP BY

+----+-------------+--------+------------+------+--------------------------------+------+---------+------+---------+----------+-------------+
| id | select_type | table  | partitions | type | possible_keys                  | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+--------+------------+------+--------------------------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | tableA | NULL       | ALL  | IDX_8d2721775c4645ba9c620ec9fa | NULL | NULL    | NULL | 6237888 |    50.00 | Using where |
+----+-------------+--------+------------+------+--------------------------------+------+---------+------+---------+----------+-------------+

I'm trying to understand two things here :

  • Why adding a group by impact performances that much even with a where clause that is supposed to reduce the number of rows to group ?
  • Why the query without the group by is still taking 30 sec on a 6M rows table ? Maybe it's just me but 6M is not that much for a database ?

Thanks for your help. If you have any optimization recommandation I'll be glad to read it!

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
Dash
  • 742
  • 7
  • 19
  • Is the EXPLAIN the same with and without the GROUP BY? – Simon Goater May 30 '23 at 11:23
  • @SimonGoater just edited with the explain, seems different. – Dash May 30 '23 at 11:34
  • 1
    Create composite index by `(last_activity_at, project_uid)`. Use DISTINCT instead of GROUP BY. – Akina May 30 '23 at 11:40
  • 1
    The thing is, i need to do a count ;-) edited the original query with the count – Dash May 30 '23 at 12:50
  • 1
    The proposed index should help with (i) where clause (ii) the data inside leaves (project_uid) could be used for grouping; MySQL won't have to touch the actual table for the query. – Salman A May 30 '23 at 13:01
  • I'm not in general a fan of creating indexes just to persuade the db engine to do a query a particular way. Maybe you can just try different variations of the query like for example SELECT a.project_uid, COUNT(a.project_uid) AS "count" FROM (SELECT project_uid FROM tableA WHERE last_activity_at BETWEEN "2023-05-01 00:00:00" AND "2023-05-31 23:59:59") a GROUP BY a.project_uid; – Simon Goater May 30 '23 at 13:22
  • Yeah I tried that and still had same or even poorer results :-( It seems grouping on a 1M rows is not working very much. Also tried with by adding the composite index, without any results :-( – Dash May 30 '23 at 13:45
  • 5 mins to do a GROUP BY on 1.37m rows does sound suspiciously slow. Can you give mysql more memory? – Simon Goater May 30 '23 at 14:19
  • 1
    It currently has 2GiB RAM, shouldn't it be enough ? – Dash May 30 '23 at 14:30
  • Hmm, I would have thought so. Sorry, out of ideas. – Simon Goater May 30 '23 at 15:09
  • 2
    Please provide `SHOW CREATE TABLE` so that we can get a clue of what `FK_c0bc0d43ba64ca68b4fc7b7a095` is. – Rick James May 30 '23 at 15:26
  • Did you mean `COUNT(uid)` or `COUNT(project_uid)` or `COUNT(*)` -- If both columns exist, it makes a big difference. If the count'd column is never NULL, you are better off with `COUNT(*)`. – Rick James May 30 '23 at 15:29
  • Which Engine are you using? What is the value of `innodb_buffer_pool_size`? How much disk space does the table take? – Rick James May 30 '23 at 15:30
  • "part of a Foreign Key" -- Which part? If it is the "wrong" part, it is useless. Again, please provide `SHOW CREATE TABLE`. – Rick James May 30 '23 at 15:32

1 Answers1

0

Assuming your query contains correctly named columns, this index should help your query.

CREATE INDEX lastact_puid_uid 
      ON tableA (last_activity, project_uid, uid);

Why does this help? The first column of the index accelerates the date-range WHERE clause, and the other two cover the query. That means the query can be satisfied directly by doing a range scan on the index.

Why doesn't it help much? Because COUNT() needs to look at every row in range. And there are many of them. Counterintuitively, COUNT() is a slow operation. That's because it promises a completely accurate result even on a busy table with other clients doing various transactions.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • If we put `last_activity` first, then I'm afraid a filesort has to be done in order to group `project_uid` together. Depending on the actual number of rows matching the where condition, if the row count is huge but the number of distinct `project_uid` isn't high, could it be better if we put `project_uid` before `last_activity` ? – blabla_bingo May 31 '23 at 07:11
  • 1
    Please be aware that "filesort" is a misleading name. It doesn't actually use file-system files to sort unless the sort step involves so much data that it must spill from RAM. With a small `project_uid` cardinality the filesort won't be expensive. At any rate the only way to know which index works better is to try one, then the other. – O. Jones May 31 '23 at 11:00
  • Yes, I totally agree. To be more precise, it should be "sort". And regarding indexing, low selectivity could take a toll on an obvious index choice, which might end up picking a seemingly suboptical one we originally ruled out. Since the OP is using 5.7, it's more difficult to find the cost at each stage without `explain analyze`. But still, using slow log is neat to find the response time and rows examined. The way to reach optimum is to try. Thank you anyways. – blabla_bingo Jun 01 '23 at 02:09