How to redefine slow SQL query with COUNT() of large dataset in MySQL

Question

I have a SQL query like this and I would need to redefine it or using indexes would help I believe, but I do not know which columns include into indexes.

b_answers has approx. tens of thousand of rows
b_projects has approx. thousands of rows
b_users has tens of rows

These AS count_* columns are needed for sorting.

SELECT
    p.id,
    p.datetime,
    u.name AS u_name,
    p.name,
    p.note,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND  changed != '0000-00-00 00:00:00') AS count_filled,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND started = '1') AS count_started,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id) AS count_sent,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '1' AND  changed != '0000-00-00 00:00:00') AS count_filled_quiz1_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '2' AND  changed != '0000-00-00 00:00:00') AS count_filled_quiz1_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '1' AND started = '1') AS count_started_quiz1_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '2' AND started = '1') AS count_started_quiz1_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '1') AS count_sent_quiz1_a
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '2') AS count_sent_quiz1_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '3' AND  changed != '0000-00-00 00:00:00') AS count_filled_quiz3_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '4' AND  changed != '0000-00-00 00:00:00') AS count_filled_quiz3_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '3' AND started = '1') AS count_started_quiz3_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '4' AND started = '1') AS count_started_quiz3_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '3') AS count_sent_quiz3_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '4') AS count_sent_quiz3_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '5' AND  changed != '0000-00-00 00:00:00') AS count_filled_quiz5_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '6' AND  changed != '0000-00-00 00:00:00') AS count_filled_quiz5_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '5' AND started = '1') AS count_started_quiz5_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '6' AND started = '1') AS count_started_quiz5_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '5') AS count_sent_quiz5_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '6') AS count_sent_quiz5_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '7' AND  changed != '0000-00-00 00:00:00') AS count_filled_quiz7_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '8' AND  changed != '0000-00-00 00:00:00') AS count_filled_quiz7_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '7' AND started = '1') AS count_started_quiz7_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '8' AND started = '1') AS count_started_quiz7_b,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '7') AS count_sent_quiz7_a,
    (SELECT COUNT(id) FROM b_answers WHERE project = t.id AND quiz = '8') AS count_sent_quiz7_b
FROM 
    b_projects p
LEFT JOIN 
    b_answers a ON a.project = p.id
LEFT JOIN 
    b_users u ON u.id = p.admin
GROUP BY
    p.nazev

Which RDBMS is this for? Please add a tag to specify whether you're using `mysql`, `postgresql`, `sql-server`, `oracle` or `db2` - or something else entirely. — marc_s, May 22 '20 at 17:07

score 2 · Answer 1 · answered May 22 '20 at 17:14

2

Use conditional aggregation! The idea is:

SELECT p.id, p.datetime, u.name AS u_name, p.name, p.note,
       SUM(a.changed <> '0000-00-00 00:00:00') AS count_filled,
       SUM(a.started = '1') AS count_started,
       . . .   -- and so one for the rest of the columns
FROM b_projects p LEFT JOIN 
     b_answers a 
     ON a.project = p.id LEFT JOIN 
     b_users u
     ON u.id = p.admin
GROUP BY p.id, p.datetime, u.name, p.name, p.note

answered May 22 '20 at 17:14

Gordon Linoff

1,242,037
58
646
786

Caution -- The JOINs may inflate the SUMs. – Rick James May 23 '20 at 04:48
@RickJames . . . Although possible that seems unlikely. The `join` from `b_projects` to `b_users` would appear to be to (at most) one user. And it is a fair guess that `p_projects.id` is the primary key in that table. – Gordon Linoff May 23 '20 at 20:39
Well, this works, but it is still very slow. Query takes tens of seconds, with the half of data set, for the first time. Next time it is fast, but it is caused by cache I believe. – sylar32 May 25 '20 at 10:31
Now, I've added indexes to columns `a.changed`, `a.started`, `a.project`, `a.quiz`, `p.admin` and now it takes approx 2 sec, but still I would like to increase the speed. I've read the article linked below, about the indexes, but I now I have no idea how to create indexes differently. – sylar32 May 25 '20 at 11:15
@sylar32 . . . I am guessing you have a lot of data -- and the `group by` is going to eat up the time. I don't see how those indexes would help. Indexes on `a(pid) and `u(admin)` might help. You can add additional columns based on what is referenced in the `SELECT`. – Gordon Linoff May 25 '20 at 21:33

score 0 · Answer 2 · answered May 22 '20 at 17:19

Just not to reinvent wheel !!! please see this. What columns generally make good indexes?

Indexes are generally required in columns used to compare (conditions). so in your case, I think indexes which can be used to improve COST would include columns in this part of the query.

LEFT JOIN b_answers a ON a.project = p.id /* consider to use index */

LEFT JOIN b_users u ON u.id = p.admin /* consider to use index */

GROUP BY p.nazev /* consider to use index - STILL UNSURE*/

you can check the effects of the query with trial and error.

Hope this helps.

Cheers

score 0 · Answer 3 · answered May 23 '20 at 04:58

To get the query right, you need to take into consideration whether the tables are in a 1:many mapping. If so, do you want the count to be "many" or just "1"

I will assume you don't want inflated values, so I will get the counts in a derived table first, then join to the other table(s):

SELECT  p.id, p.datetime, u.name AS u_name, p.name, p.note,
        count_filled, count_started, ...
    FROM  
        ( SELECT
                SUM(a.changed <> '0000-00-00 00:00:00') AS count_filled,
                SUM(a.started = '1') AS count_started,
                ...
            FROM  b_answers AS a 
        ) AS aa
    JOIN  b_projects p  ON aa.project = p.id
    LEFT JOIN  b_users u  ON u.id = p.admin

Note that this avoids the GROUP BY, thereby providing one speedup. The other is avoiding the "inflate-deflate".

Assuming id is the PRIMARY KEY of each table, no extra indexes are needed.

How to redefine slow SQL query with COUNT() of large dataset in MySQL

3 Answers3