Very slow GROUP BY and ORDER BY

Question

I am trying to create correct indexes, but there is a problem

Here is the table structure:

CREATE TABLE mail (
    id            int                                                           NOT NULL,
    account_id    int                                                               NULL DEFAULT NULL,
    folder_path   varchar(150) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
    mail_id       int                                                               NULL DEFAULT NULL,
    user_id       int                                                               NULL DEFAULT NULL COMMENT 'текущий менеджер',
    ticket_number varchar(20)  CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
    from_user_id  int                                                               NULL DEFAULT NULL,
    to_address    longtext     CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    reply_to      longtext     CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    bcc           longtext     CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    cc            longtext     CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    subject       varchar(350) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci     NULL DEFAULT NULL,
    read_status   tinyint(1)                                                    NOT NULL DEFAULT '0',
    created_at    timestamp                                                     NOT NULL,
    updated_at    timestamp                                                         NULL DEFAULT NULL

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;
    
ALTER TABLE mail

    ADD KEY from_address      ( from_address(255) ),
    ADD KEY to_address        ( to_address(255) ),
    ADD KEY subject           ( subject ),
    ADD KEY id                ( id ) USING BTREE,
    ADD KEY user_id           ( user_id ),
    ADD KEY created_at        ( created_at ),
    ADD KEY folder_path       ( folder_path, ticket_number, id, created_at ) USING BTREE,
    ADD KEY folder_mail_count ( folder_path, read_status ),
    ADD KEY ticket_number     ( ticket_number, folder_path, id ) USING BTREE;
    
ALTER TABLE mail
    MODIFY id int NOT NULL AUTO_INCREMENT;

With next query, I get a list of emails grouped by ticket and sorted by date created_at:

SELECT
    m1.*,
    m1.mail_at AS receivedAt
FROM
    mail AS m1
    LEFT JOIN mail AS m2 ON
        m1.ticket_number = m2.ticket_number
        AND
        m1.folder_path = m2.folder_path
        AND
        m1.id < m2.id
WHERE
    ( m1.folder_path = 'INBOX' )
    AND
    m2.id IS NULL
GROUP BY
    m1.ticket_number
ORDER BY
    m1.created_at DESC
    LIMIT 20 OFFSET 0;

There are about 200k records in the table. The execution time is 1.3 - 1.5 seconds

+----+-------------+-------+------------+------+------------------------------------------------+-------------+---------+-----------------------------------------------------------+-------+----------+--------------------------------------+
| id | select_type | table | partitions | type | possible_keys                                  | key         | key_len | ref                                                       | rows  | filtered | Extra                                |
+----+-------------+-------+------------+------+------------------------------------------------+-------------+---------+-----------------------------------------------------------+-------+----------+--------------------------------------+
|  1 | SIMPLE      | m1    | NULL       | ref  | folder_path,folder_mail_count,ticket_number    | folder_path | 602     | const                                                     | 47494 |   100.00 | Using temporary; Using filesort      |
|  1 | SIMPLE      | m2    | NULL       | ref  | id,folder_path,folder_mail_count,ticket_number | folder_path | 684     | sandbox_busf.m1.folder_path,sandbox_busf.m1.ticket_number |     1 |    10.00 | Using where; Not exists; Using index |
+----+-------------+-------+------------+------+------------------------------------------------+-------------+---------+-----------------------------------------------------------+-------+----------+--------------------------------------+

If I remove ORDER BY, the query is executed in 0.004 sec.

+----+-------------+-------+------------+------+------------------------------------------------+-------------+---------+-----------------------------------------------------------+-------+----------+--------------------------------------+
| id | select_type | table | partitions | type | possible_keys                                  | key         | key_len | ref                                                       | rows  | filtered | Extra                                |
+----+-------------+-------+------------+------+------------------------------------------------+-------------+---------+-----------------------------------------------------------+-------+----------+--------------------------------------+
|  1 | SIMPLE      | m1    | NULL       | ref  | folder_path,folder_mail_count,ticket_number    | folder_path | 602     | const                                                     | 47494 |   100.00 | NULL                                 |
|  1 | SIMPLE      | m2    | NULL       | ref  | id,folder_path,folder_mail_count,ticket_number | folder_path | 684     | sandbox_busf.m1.folder_path,sandbox_busf.m1.ticket_number |     1 |    10.00 | Using where; Not exists; Using index |
+----+-------------+-------+------------+------+------------------------------------------------+-------------+---------+-----------------------------------------------------------+-------+----------+--------------------------------------+

Please help me create correct indexes.

BTW, you have `DEFAULT CHARSET=utf8mb3;` but you're explicitly setting `CHARACTER SET utf8mb4` on every textual column - so you can simplify your `CREATE TABLE` statement by changing the `DEFAULT CHARSET` and removing the per-column annotations. — Dai, Dec 10 '22 at 03:33
The query is being _slightly_ slowed by the use of `m1.*` - do you **really** need every column in the output? — Dai, Dec 10 '22 at 03:34
I'm unfamiliar with MySQL's `KEY` syntax - but from here it looks like your table doesn't have any `PRIMARY KEY` defined... — Dai, Dec 10 '22 at 03:35
...also, your indexes' names should be distinct from your column names, otherwise it's hard to read `possible_keys`. — Dai, Dec 10 '22 at 03:35
@Dai thanks for your comment. I will definitely fix it, but it won't speed up the query much. About m1.*. You are right, of all the fields I don't need only account_id, reply_to, bcc, cc, updated_at — Sergey Vorobev, Dec 10 '22 at 03:41
@Dai, FYI MySQL has a weird behavior that if you defined columns with an override character set, then you change the table's default character set to match, it _doesn't_ remove the override from each column. The metadata still has the specific character set on each column, even though it's the same as the table's default. It's weird, but harmless. — Bill Karwin, Dec 10 '22 at 03:45
Only use keys for what you're sorting or joining on. Why do you need to sort on folder path as well as ticket number? Can you only sort on ticket number? Can you make it unique regardeless of folder path Grouping by a long string like folder path will slow things down. You shouldn't have combined keys named after only one field. In fact, get rid of keys altogether and create indexes instead. Not sure that makes a difference, but I think it's more standard. The key should probably only be id. Make an index for ticket, another one for path, and a combined index of ticket and path maybe. — Jack BeNimble, Dec 10 '22 at 03:45
@JackBeNimble In MySQL, a `KEY` **is** an `INDEX`. [It's a (confusing and _dumb_) keyword synonym](https://stackoverflow.com/questions/1401572/whats-the-difference-between-using-index-vs-key-in-mysql). — Dai, Dec 10 '22 at 03:47
Right, that's why I said not sure it makes a difference. But good to know. — Jack BeNimble, Dec 10 '22 at 03:48
The query is malformed but MySQL unfortunately runs it. Columns not present in the `GROUP BY` clause should be aggregated in the select list. As it is it will produce unpredictable results. Voting to close. — The Impaler, Dec 10 '22 at 05:25

score 2 · Answer 1 · answered Dec 10 '22 at 07:14

      AND  m1.folder_path = m2.folder_path
      AND  m1.id < m2.id

smells like "group-wise" max as mentioned in the reference manual. It is terribly inefficient. Better choices are in Groupwise-Max

id feels like a PRIMARY KEY, but it is not declared as such. Every table needs a PK.

It's not just "removing the ORDER BY", it is also leaving behind the "LIMIT".

With just a LIMIT, any 20 rows can be delivered; adding in the ORDER BY, requires sorting all the data found by the GROUP BY, and only then peel off 20 rows.

Very slow GROUP BY and ORDER BY

1 Answers1