Queries for group-wise maximums and minimums provide a good starting point; the tricky part can be figuring out how to efficiently integrate the restriction on IDs.
Correlated Subquery
A correlated subquery is easy to adapt, but can be poor in performance. Without the restriction, the first example in the MySQL manual can be converted by a simple change of names:
SELECT patientId, detailId, `value`
FROM patient_details AS pd
WHERE detailId=(
SELECT MAX(pd2.detailId)
FROM patient_details AS pd2
WHERE pd.patientId = pd2.patientId
)
;
Since the sub-query is the simple maximum from the results, adding a condition to the WHERE
clause will restrict the detailId
s appropriately:
SELECT patientId, detailId, `value`
FROM patient_details AS pd
WHERE detailId=(
SELECT MAX(pd2.detailId)
FROM patient_details AS pd2
WHERE pd.patientId = pd2.patientId
AND pd2.detailId IN (75, 151, 203) -- the added restriction
)
;
Performance-wise, this involves a full table scan for the outer query; applying EXPLAIN
to the query for the sample schema & data gives the following plan:
id |
select_type |
table |
partitions |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
pd |
NULL |
ALL |
NULL |
NULL |
NULL |
NULL |
17 |
100 |
Using where |
2 |
DEPENDENT SUBQUERY |
pd2 |
NULL |
ref |
PRIMARY,details_of_patient |
PRIMARY |
4 |
pd.patientId |
4 |
30 |
Using where; Using index |
Note the primary query joins 100% of the 17 rows to the dependent subquery.
Optimization
The uncorrelated subquery below will suggest an optimization: restricting the range of detailId
. Applying the optimization to the correlated subquery gives:
SELECT patientId, detailId, `value`
FROM patient_details AS pd
WHERE detailId=(
SELECT MAX(pd2.detailId)
FROM patient_details AS pd2
WHERE pd.patientId = pd2.patientId
AND pd2.detailId IN (75, 151, 203)
AND pd2.detailId BETWEEN 75 AND 203 -- optimization
) AND pd.detailId BETWEEN 75 AND 203 -- optimization
;
This cuts the number of rows from the primary query that get joined by 9:
id |
select_type |
table |
partitions |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
pd |
NULL |
ALL |
NULL |
NULL |
NULL |
NULL |
17 |
11.11 |
Using where |
2 |
DEPENDENT SUBQUERY |
pd2 |
NULL |
ref |
PRIMARY,details_of_patient |
PRIMARY |
4 |
so.pd.patientId |
4 |
5.88 |
Using where; Using index |
The optimization similarly effects row filtering in the dependent subquery, but that ultimately won't affect performance as these rows aren't joined to anything following.
Uncorrelated Subquery
The uncorrelated subquery is also easily adapted in the same way: change the names and add the condition to the subquery:
SELECT pd.patientId, pd.detailId, pd.`value`
FROM patient_details AS pd
JOIN (
SELECT patientId, MAX(detailId) AS detailId
FROM patient_details
WHERE detailId IN (75, 151, 203) -- the added restriction
GROUP BY patientId
) AS pd2
ON pd.patientId = pd2.patientId AND pd.detailId = pd2.detailId
;
However, performance is a little worse than the basic correlated subquery:
id |
select_type |
table |
partitions |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
|
NULL |
ALL |
NULL |
NULL |
NULL |
NULL |
5 |
100 |
Using where |
1 |
PRIMARY |
pd |
NULL |
eq_ref |
PRIMARY,details_of_patient |
PRIMARY |
8 |
pd2.patientId,pd2.detailId |
1 |
100 |
NULL |
2 |
DERIVED |
patient_details |
NULL |
index |
PRIMARY,details_of_patient |
details_of_patient |
8 |
NULL |
17 |
30 |
Using where; Using index |
Optimization
The poor performance is largely because of the IN
condition. The plan for the unrestricted query is much more performant:
id |
select_type |
table |
partitions |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
|
NULL |
ALL |
NULL |
NULL |
NULL |
NULL |
5 |
100 |
Using where |
1 |
PRIMARY |
pd |
NULL |
eq_ref |
PRIMARY,details_of_patient |
PRIMARY |
8 |
pd2.patientId,pd2.detailId |
1 |
100 |
NULL |
2 |
DERIVED |
patient_details |
NULL |
range |
PRIMARY,details_of_patient |
PRIMARY |
4 |
NULL |
5 |
100 |
Using index for group-by |
Note that according to this plan, the query uses a range join on pd2
, as opposed to the index scan shown in the plan for the restricted query. This points to a possible optimization: add a range condition on detailId
:
SELECT pd.patientId, pd.detailId, pd.`value`
FROM patient_details AS pd
JOIN (
SELECT patientId, MAX(detailId) AS detailId
FROM patient_details
WHERE detailId IN (75, 151, 203)
AND detailId BETWEEN 75 AND 203 -- the added optimization
GROUP BY patientId
) AS pd2
ON pd.patientId = pd2.patientId AND pd.detailId = pd2.detailId
;
This results in a better plan than the unoptimized query (though still trailing the optimized queries), as it reduces the number of rows in the primary select:
id |
select_type |
table |
partitions |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
|
NULL |
ALL |
NULL |
NULL |
NULL |
NULL |
2 |
100 |
Using where |
1 |
PRIMARY |
pd |
NULL |
eq_ref |
PRIMARY,details_of_patient |
PRIMARY |
8 |
pd2.patientId,pd2.detailId |
1 |
100 |
NULL |
2 |
DERIVED |
patient_details |
NULL |
index |
PRIMARY,details_of_patient |
details_of_patient |
8 |
NULL |
17 |
5.88 |
Using where; Using index |
As of MySQL 8.0, the query can be written with a CTE:
WITH pd2 AS (
SELECT patientId, MAX(detailId) AS detailId
FROM patient_details
WHERE detailId IN (75, 151, 203)
AND detailId BETWEEN 75 AND 203
GROUP BY patientId)
SELECT pd.patientId, pd.detailId, pd.`value`
FROM patient_details AS pd
JOIN pd2
ON pd.patientId = pd2.patientId AND pd.detailId = pd2.detailId
;
LEFT JOIN
The left join is the trickiest to adapt, as naively adding the detailID
restriction will produce incorrect results most of the time. Starting from the documentation example (after renaming):
SELECT pd.patientId, pd.detailId, pd.`value`
FROM patient_details AS pd
LEFT JOIN patient_details AS pd2
ON pd.patientId = pd2.patientId
AND pd.detailId < pd2.detailId
WHERE pd2.patientId IS NULL
;
Adding conditions to the WHERE
clause will fail in two different ways:
- The
JOIN
happens before the WHERE
; by the time the latter clause is evaluated, the desired rows are likely already excluded.
pd2.detailId
will be NULL for the groupwise-maximum rows, so testing it will always fail for exactly the desired rows.
The filtering must happen before the JOIN
which means it must be done in subqueries replacing the simple table references:
SELECT pd.patientId, pd.detailId, pd.`value`
FROM (SELECT * FROM patient_details WHERE detailId IN (75, 151, 203)) AS pd
LEFT JOIN (SELECT * FROM patient_details WHERE detailId IN (75, 151, 203)) AS pd2
ON pd.patientId = pd2.patientId
AND pd.detailId < pd2.detailId
WHERE pd2.patientId IS NULL
;
Note the subquery is repeated. For MySQL 8.0 and higher, the query can be written using a CTE, which is DRYer (and thus more robust):
WITH pd AS (SELECT * FROM patient_details WHERE detailId IN (75, 151, 203))
SELECT pd.patientId, pd.detailId, pd.`value`
FROM pd
LEFT JOIN pd AS pd2
ON pd.patientId = pd2.patientId
AND pd.detailId < pd2.detailId
WHERE pd2.patientId IS NULL
;
The quantities in the estimated performance is similar to the correlated subquery:
id |
select_type |
table |
partitions |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
SIMPLE |
patient_details |
NULL |
ALL |
NULL |
NULL |
NULL |
NULL |
17 |
30 |
Using where |
1 |
SIMPLE |
patient_details |
NULL |
ref |
PRIMARY,details_of_patient |
PRIMARY |
4 |
patient_details.patientId |
4 |
100 |
Using where; Using index |
However, this does a better job filtering the rows early, resulting in only 30% of 17 rows, or about 5 rows, of pd
getting joined to pd2
.
Optimization
The optimization for the uncorrelated subquery can be applied to this query:
WITH pd AS (
SELECT *
FROM patient_details
WHERE detailId IN (75, 151, 203)
AND detailId BETWEEN 75 AND 203
)
SELECT pd.patientId, pd.detailId, pd.`value`
FROM pd
LEFT JOIN pd AS pd2
ON pd.patientId = pd2.patientId
AND pd.detailId < pd2.detailId
WHERE pd2.patientId IS NULL
;
This does an even better job in filtering out rows before the join:
id |
select_type |
table |
partitions |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
SIMPLE |
patient_details |
NULL |
ALL |
NULL |
NULL |
NULL |
NULL |
17 |
5.88 |
Using where |
1 |
SIMPLE |
patient_details |
NULL |
ref |
PRIMARY,details_of_patient |
PRIMARY |
4 |
so.patient_details.patientId |
4 |
100 |
Using where; Using index |
Note an estimated 5.88% of the 17 rows from pd
will get joined to pd2
, one of the best performances of all the options.
CTE with Window Function
The last example from the MySQL documentation uses a window function, Rank
, to assign a position to each row and filter based on that position. As with the subquery approaches, this query is easily adapted by the simple inclusion of the condition, but only works starting with MySQL 8.0 (as that's the first version to support window functions). After the usual rename and restriction addition:
WITH pd AS (
SELECT patientId, detailId, `value`,
RANK() OVER (PARTITION BY patientId
ORDER BY detailId DESC
) AS `Rank`
FROM patient_details
WHERE detailId IN (75, 151, 203)
)
SELECT patientId, detailId, `value`
FROM pd
WHERE `Rank` = 1
;
Note this is basically the query in SOS's answer, only using Rank
rather than Row_Number
. Performance is quite good, comparatively:
id |
select_type |
table |
partitions |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
PRIMARY |
|
|
ref |
<auto_key0> |
<auto_key0> |
8 |
const |
1 |
100.00 |
|
2 |
DERIVED |
patient_details |
|
ALL |
|
|
|
|
17 |
30 |
Using where; Using filesort |
Optimizations won't improve matters, as the primary select cannot be further optimized. Adding the usual optimization results in more rows being filtered out of the derived selection, but (as noted with the optimized correlated subquery) this won't impact performance, since the derived selection isn't then joined to anything else.
LIMIT
Consider Paul T's solution:
SELECT patientId, value
FROM patient_details
WHERE detailID IN (75, 151, 203)
AND patientId = 123456
ORDER BY detailID DESC
LIMIT 1;
This has great performance for a single ID, scanning only 3 rows:
id |
select_type |
table |
partitions |
type |
possible_keys |
key |
key_len |
ref |
rows |
filtered |
Extra |
1 |
SIMPLE |
patient_details |
NULL |
range |
PRIMARY,details_of_patient |
PRIMARY |
8 |
NULL |
3 |
100 |
Using where |
Applying the range restriction on detailId
optimization to this query does not improve performance.
Note this query only returns the results for a single patient ID. To get the results for multiple patients, the query would need to be run separately for each, resulting in performance comparable to the unoptimized correlated subquery (though with more round-trips to the DB).