0

Given the following query, how can I optimize it so that the subqueries are not dependent?

SELECT DISTINCT
    inst.id, inst.name, inst.state, inst.farm_status,
    (SELECT COUNT(inst_note.id) 
        FROM project_institution_note AS inst_note
        WHERE inst_note.institution_id = inst.id) AS inst_note_count,
    (SELECT COUNT(c.id) FROM project_catalog AS c
        WHERE c.institution_id = inst.id 
        AND c.status = 0 
        AND c.catalog_type BETWEEN 0 AND 1) AS ug_count,
    (SELECT COUNT(c.id) FROM project_catalog AS c
        WHERE c.institution_id = inst.id 
        AND c.status = 0 
        AND c.catalog_type BETWEEN 1 AND 2) AS grad_count,
    (SELECT COUNT(c.id) FROM project_catalog AS c
        WHERE c.institution_id = inst.id 
        AND c.status = 0 AND c.catalog_type >= 3) AS alt_count,
    (SELECT COUNT(c.id) FROM project_catalog_note AS cn
        INNER JOIN farmtool_catalog AS c
        ON c.id = cn.catalog_id
        WHERE c.institution_id = inst.id) AS catalog_note_count,
    (SELECT inst_note.text FROM project_institution_note AS inst_note
        LEFT JOIN project_institution AS inst
        ON inst_note.institution_id = inst.id
        WHERE inst_note.institution_id = inst.id
        ORDER BY inst_note.date DESC
        LIMIT 1) AS latest_note
FROM project_institution AS inst
LEFT JOIN project_institution_note AS inst_note
ON inst.id = inst_note.institution_id
LEFT JOIN project_catalog AS c
ON inst.id = c.institution_id
WHERE LOWER(inst.state) = "me";

I've tried refactoring the first subquery into an INNER JOIN like so:

INNER JOIN (SELECT COUNT(inst_note.id) 
        FROM project_institution_note AS inst_note
        GROUP BY inst_note.institution_id) inst_note_count 
        ON inst_note.institution_id = inst.id

and included it after the last LEFT JOIN operation, but returned an empty result.

Of particular interest for me is optimizing the second and third subqueries that calcuate ug_count and grad_count. The only difference between the two is the first is dependent on a field value between (0, 1), and the second between (1, 2).

Right now, this query runs fine, and is in a low use scenario. Still, its obviously pretty inefficient, so I'd like to optimize if possible.

Jason
  • 11,263
  • 21
  • 87
  • 181
  • i would use temporary tables. it keeps things clear and avoid all of these subqueries – Alee Dec 02 '16 at 13:14
  • Do you mean views? – Jason Dec 02 '16 at 13:14
  • no views are different from temporary table. http://stackoverflow.com/questions/16897323/what-to-use-view-or-temporary-table. i'll try to post an answer with some examples. – Alee Dec 02 '16 at 13:20

2 Answers2

0

This should hopefully get you part of the way.

SELECT 
    inst.id, inst.name, inst.state, inst.farm_status,
    COUNT(DISTINCT inst_note.id)  AS inst_note_count,
    SUM(CASE WHEN c.status = 0 and c.catalog_type BETWEEN 0 AND 1 THEN 1 ELSE 0 END) AS ug_count,
    SUM(CASE WHEN c.status = 0 and c.catalog_type BETWEEN 1 AND 2 THEN 1 ELSE 0 END) AS grad_count,
    SUM(CASE WHEN c.status = 0 and c.catalog_type >= 3 THEN 1 ELSE 0 END) AS alt_count,
    COUNT(DISTINCT cn.id) AS catalog_note_count,
    (SELECT inst_note.text FROM project_institution_note AS inst_note
        LEFT JOIN project_institution AS inst
        ON inst_note.institution_id = inst.id
        WHERE inst_note.institution_id = inst.id
        ORDER BY inst_note.date DESC
        LIMIT 1) AS latest_note
FROM project_institution AS inst
LEFT JOIN project_institution_note AS inst_note ON inst.id = inst_note.institution_id
LEFT JOIN project_catalog AS c ON inst.id = c.institution_id
LEFT JOIN farmtool_catalog AS fc ON fc.institution_id = inst.id
LEFT JOIN project_catalog_note AS cn ON fc.id = cn.catalog_id
WHERE LOWER(inst.state) = "me"
GROUP BY inst.id, inst.name, inst.state, inst.farm_status;

Not sure if you can do anything about getting the latest note. In Sql-Server I would use a cte and a window function, but these things are not available in MySql. I hope this is useful anyway.

James Casey
  • 2,447
  • 1
  • 11
  • 19
0

Perhaps the biggest performance killer is

WHERE LOWER(inst.state) = "me";

Make the COLLATION for state be one of the ..._ci collations (which it probably already is), then change to simply

WHERE inst.state = "me";

And be sure to have

INDEX(state)

Meanwhile, don't do this:

JOIN ... ON inst_note.institution_id = inst.id
      WHERE inst_note.institution_id = inst.id

The ON and WHERE are redundant and do exactly the same thing. Since that is (I assume) how the tables are linked; keep the ON. But...

That was really LEFT JOIN..., which means "keep the 'right' table even if there is no matching row. But, then the WHERE will fail. So... Make it JOIN and get rid of the WHERE clause.

In general, don't use LEFT unless you need it.

For the original attempt, this (and some other 'composite' indexes) would be helpful:

project_catalog: INDEX(institution_id, status, catalog_type)

A potential performance problem with JOIN plus GROUP BY is an explosion, then implosion of rows being handled. To avoid it, an improvement on Casey's suggestion is to have a derived table to calculate the various SUMs, then JOIN back to inst, now without the GROUP BY. This avoids hauling around the 4 columns in the GROUP BY (and perhaps some other benefits).

Rick James
  • 135,179
  • 13
  • 127
  • 222