0

I am attempting to pull the value associated with the highest detailID of a set of detailIDs that are predefined for each patientID.

here is an example of the table (also available as a fiddle):

CREATE TABLE `patient_details` (
    patientId INT UNSIGNED NOT NULL,
    detailId INT UNSIGNED NOT NULL,
    `value` VARCHAR(256),
    PRIMARY KEY (patientId, detailId),
    UNIQUE INDEX details_of_patient (patientId, detailId)
);

INSERT INTO patient_details (patientId, detailId, `value`)
  VALUES
(123, 75, '01'),
(123, 98, '02'),
(123, 151, '03 hit'),
(123, 251, '04'),

(321, 65, '05'),
(321, 75, '04'),
(321, 98, '03'),
(321, 151, '02 hit'),
(321, 180, '01'),

(123456, 75, '01'),
(123456, 89, '12/01/2022'),
(123456, 151, '03 hit'),
(123456, 215, '5681'),

(678910, 75, '01'),
(678910, 151, '03'),
(678910, 203, '12/01/2022 hit'),
(678910, 215, '56813')
;

What I need to do is pull the value of the highest detailID of 75, 151, 203.

I have tried using if function, to test the detailID but get syntax errors.

logically I am thinking I ought to be able to do something like this nested IF

select 
patientId,
table.value

if(detailid=203,set @largest_detailID=203,if(detailid=151,set @largest_detailID=151,if(detailid=75,set @largest_detailID=75,)))

from table

where detailID=@largest_detailID

What I would expect as a result

patientID value
123 03 hit
321 02 hit
123456 03 hit
678910 12/01/2022 hit

While there are a number of questions and answers on this site addressing getting non-aggregate columns from rows corresponding to a maximum or minimum value (such as "Retrieving the last record in each group", "MySQL Left Join + Min"), they don't select from a limited list of values, nor is it trivial to adapt the answers to add this restriction.

outis
  • 75,655
  • 22
  • 151
  • 221
oeohfppe
  • 3
  • 2
  • WHERE is applied before SELECT – MatBailie Apr 24 '22 at 09:34
  • Since SQL includes data definition, a [mcve] for an [SQL question](//meta.stackoverflow.com/q/333952/90527) should include [DDL](//en.wikipedia.org/wiki/Data_definition_language) statements for sample tables (rather than an ad hoc table specification) and [DML](//en.wikipedia.org/wiki/Data_manipulation_language) statements for sample data (rather than a dump or ad hoc format). Desired results don't need to be presented as sample code, as results are the output of code and not code themselves. – outis Apr 25 '22 at 01:00
  • You should not be doing any numeric comparisons of the value of an ID column. Are you using "highest detailid" to mean "most recent record"? If so, then add a timestamp to each row, and find the highest timestamp. – Andy Lester Apr 25 '22 at 21:39
  • Outis, thanks for the information I will do my best to remember that for next time, thank you for adding the additional info into my post. – oeohfppe Apr 26 '22 at 02:35
  • Andy, thanks for the information, I am reading an existing table, I do not recall there being a time stamp. – oeohfppe Apr 26 '22 at 02:37

3 Answers3

1

Instead of trying to mess with all the IF's and @ variables, why not use a descending order of the detailID to help instead, and then add a LIMIT 1 to get only the highest reference that exists based on the 3 detailID numbers from your criteria:

Try with:

SELECT patientId, value
  FROM patientInfo
 WHERE detailID IN (75, 151, 203)
 ORDER BY detailID DESC
 LIMIT 1;

... and since a detailID of 203 and 89 does not exist in the query results will get the expected entry:

enter image description here

Example dbfiddle.

Paul T.
  • 4,703
  • 11
  • 25
  • 29
0

Assuming you want to apply this logic across multiple patients, try using ROW_NUMBER() to partition the data by patientId then sort based on the highest detailId:

WITH cte AS (
    SELECT * 
           , ROW_NUMBER() OVER(
                 PARTITION BY PatientId 
                 ORDER BY DetailID DESC
            ) AS RowNum
    FROM   YourTable
    WHERE  DetailId IN (75, 151, 203)
)
SELECT * 
FROM   cte 
WHERE  RowNum = 1

Sample Data:

patientId detailId value
123456 75 01
123456 89 12/01/2022
123456 151 03
123456 215 56813
678910 75 01
678910 203 12/01/2022
678910 151 03
678910 215 56813

Results:

patientId detailId value RowNum
123456 151 03 1
678910 203 12/01/2022 1

db<>fiddle here

SOS
  • 6,430
  • 2
  • 11
  • 29
0

Queries for group-wise maximums and minimums provide a good starting point; the tricky part can be figuring out how to efficiently integrate the restriction on IDs.

Correlated Subquery

A correlated subquery is easy to adapt, but can be poor in performance. Without the restriction, the first example in the MySQL manual can be converted by a simple change of names:

SELECT patientId, detailId, `value`
FROM   patient_details AS pd
WHERE  detailId=(
           SELECT MAX(pd2.detailId)
             FROM patient_details AS pd2
             WHERE pd.patientId = pd2.patientId
       )
;

Since the sub-query is the simple maximum from the results, adding a condition to the WHERE clause will restrict the detailIds appropriately:

SELECT patientId, detailId, `value`
FROM   patient_details AS pd
WHERE  detailId=(
           SELECT MAX(pd2.detailId)
             FROM patient_details AS pd2
             WHERE pd.patientId = pd2.patientId
               AND pd2.detailId IN (75, 151, 203) -- the added restriction
       )
;

Performance-wise, this involves a full table scan for the outer query; applying EXPLAIN to the query for the sample schema & data gives the following plan:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY pd NULL ALL NULL NULL NULL NULL 17 100 Using where
2 DEPENDENT SUBQUERY pd2 NULL ref PRIMARY,details_of_patient PRIMARY 4 pd.patientId 4 30 Using where; Using index

Note the primary query joins 100% of the 17 rows to the dependent subquery.

Optimization

The uncorrelated subquery below will suggest an optimization: restricting the range of detailId. Applying the optimization to the correlated subquery gives:

SELECT patientId, detailId, `value`
FROM   patient_details AS pd
WHERE  detailId=(
           SELECT MAX(pd2.detailId)
              FROM patient_details AS pd2
              WHERE pd.patientId = pd2.patientId
                AND pd2.detailId IN (75, 151, 203)
                AND pd2.detailId BETWEEN 75 AND 203 -- optimization
       ) AND pd.detailId BETWEEN 75 AND 203 -- optimization
;

This cuts the number of rows from the primary query that get joined by 9:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY pd NULL ALL NULL NULL NULL NULL 17 11.11 Using where
2 DEPENDENT SUBQUERY pd2 NULL ref PRIMARY,details_of_patient PRIMARY 4 so.pd.patientId 4 5.88 Using where; Using index

The optimization similarly effects row filtering in the dependent subquery, but that ultimately won't affect performance as these rows aren't joined to anything following.

Uncorrelated Subquery

The uncorrelated subquery is also easily adapted in the same way: change the names and add the condition to the subquery:

SELECT pd.patientId, pd.detailId, pd.`value`
  FROM patient_details AS pd
    JOIN (
        SELECT patientId, MAX(detailId) AS detailId
          FROM patient_details
          WHERE detailId IN (75, 151, 203) -- the added restriction
          GROUP BY patientId
      ) AS pd2
      ON pd.patientId = pd2.patientId AND pd.detailId = pd2.detailId
;

However, performance is a little worse than the basic correlated subquery:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY NULL ALL NULL NULL NULL NULL 5 100 Using where
1 PRIMARY pd NULL eq_ref PRIMARY,details_of_patient PRIMARY 8 pd2.patientId,pd2.detailId 1 100 NULL
2 DERIVED patient_details NULL index PRIMARY,details_of_patient details_of_patient 8 NULL 17 30 Using where; Using index

Optimization

The poor performance is largely because of the IN condition. The plan for the unrestricted query is much more performant:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY NULL ALL NULL NULL NULL NULL 5 100 Using where
1 PRIMARY pd NULL eq_ref PRIMARY,details_of_patient PRIMARY 8 pd2.patientId,pd2.detailId 1 100 NULL
2 DERIVED patient_details NULL range PRIMARY,details_of_patient PRIMARY 4 NULL 5 100 Using index for group-by

Note that according to this plan, the query uses a range join on pd2, as opposed to the index scan shown in the plan for the restricted query. This points to a possible optimization: add a range condition on detailId:

SELECT pd.patientId, pd.detailId, pd.`value`
  FROM patient_details AS pd
    JOIN (
        SELECT patientId, MAX(detailId) AS detailId
          FROM patient_details
          WHERE detailId IN (75, 151, 203)
            AND detailId BETWEEN 75 AND 203 -- the added optimization
          GROUP BY patientId
      ) AS pd2
      ON pd.patientId = pd2.patientId AND pd.detailId = pd2.detailId
;

This results in a better plan than the unoptimized query (though still trailing the optimized queries), as it reduces the number of rows in the primary select:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY NULL ALL NULL NULL NULL NULL 2 100 Using where
1 PRIMARY pd NULL eq_ref PRIMARY,details_of_patient PRIMARY 8 pd2.patientId,pd2.detailId 1 100 NULL
2 DERIVED patient_details NULL index PRIMARY,details_of_patient details_of_patient 8 NULL 17 5.88 Using where; Using index

As of MySQL 8.0, the query can be written with a CTE:

WITH pd2 AS (
    SELECT patientId, MAX(detailId) AS detailId
      FROM patient_details
      WHERE detailId IN (75, 151, 203)
        AND detailId BETWEEN 75 AND 203
      GROUP BY patientId)
SELECT pd.patientId, pd.detailId, pd.`value`
  FROM patient_details AS pd
    JOIN pd2
      ON pd.patientId = pd2.patientId AND pd.detailId = pd2.detailId
;

LEFT JOIN

The left join is the trickiest to adapt, as naively adding the detailID restriction will produce incorrect results most of the time. Starting from the documentation example (after renaming):

SELECT pd.patientId, pd.detailId, pd.`value`
  FROM patient_details AS pd
    LEFT JOIN patient_details AS pd2
      ON     pd.patientId = pd2.patientId
         AND pd.detailId < pd2.detailId
  WHERE pd2.patientId IS NULL
;

Adding conditions to the WHERE clause will fail in two different ways:

  1. The JOIN happens before the WHERE; by the time the latter clause is evaluated, the desired rows are likely already excluded.
  2. pd2.detailId will be NULL for the groupwise-maximum rows, so testing it will always fail for exactly the desired rows.

The filtering must happen before the JOIN which means it must be done in subqueries replacing the simple table references:

SELECT pd.patientId, pd.detailId, pd.`value`
  FROM (SELECT * FROM patient_details WHERE detailId IN (75, 151, 203)) AS pd
    LEFT JOIN (SELECT * FROM patient_details WHERE detailId IN (75, 151, 203)) AS pd2
      ON     pd.patientId = pd2.patientId
         AND pd.detailId < pd2.detailId
  WHERE pd2.patientId IS NULL
;

Note the subquery is repeated. For MySQL 8.0 and higher, the query can be written using a CTE, which is DRYer (and thus more robust):

WITH pd AS (SELECT * FROM patient_details WHERE detailId IN (75, 151, 203))
SELECT pd.patientId, pd.detailId, pd.`value`
  FROM pd
    LEFT JOIN pd AS pd2
      ON     pd.patientId = pd2.patientId
         AND pd.detailId < pd2.detailId
  WHERE pd2.patientId IS NULL
;

The quantities in the estimated performance is similar to the correlated subquery:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE patient_details NULL ALL NULL NULL NULL NULL 17 30 Using where
1 SIMPLE patient_details NULL ref PRIMARY,details_of_patient PRIMARY 4 patient_details.patientId 4 100 Using where; Using index

However, this does a better job filtering the rows early, resulting in only 30% of 17 rows, or about 5 rows, of pd getting joined to pd2.

Optimization

The optimization for the uncorrelated subquery can be applied to this query:

WITH pd AS (
    SELECT * 
      FROM patient_details 
      WHERE detailId IN (75, 151, 203)
        AND detailId BETWEEN 75 AND 203
  )
SELECT pd.patientId, pd.detailId, pd.`value`
  FROM pd
    LEFT JOIN pd AS pd2
      ON     pd.patientId = pd2.patientId
         AND pd.detailId < pd2.detailId
  WHERE pd2.patientId IS NULL
;

This does an even better job in filtering out rows before the join:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE patient_details NULL ALL NULL NULL NULL NULL 17 5.88 Using where
1 SIMPLE patient_details NULL ref PRIMARY,details_of_patient PRIMARY 4 so.patient_details.patientId 4 100 Using where; Using index

Note an estimated 5.88% of the 17 rows from pd will get joined to pd2, one of the best performances of all the options.

CTE with Window Function

The last example from the MySQL documentation uses a window function, Rank, to assign a position to each row and filter based on that position. As with the subquery approaches, this query is easily adapted by the simple inclusion of the condition, but only works starting with MySQL 8.0 (as that's the first version to support window functions). After the usual rename and restriction addition:

WITH pd AS (
   SELECT patientId, detailId, `value`,
          RANK() OVER (PARTITION BY patientId
                           ORDER BY detailId DESC
                      ) AS `Rank`
     FROM patient_details
     WHERE detailId IN (75, 151, 203)
)
SELECT patientId, detailId, `value`
  FROM pd
  WHERE `Rank` = 1
;

Note this is basically the query in SOS's answer, only using Rank rather than Row_Number. Performance is quite good, comparatively:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY ref <auto_key0> <auto_key0> 8 const 1 100.00
2 DERIVED patient_details ALL 17 30 Using where; Using filesort

Optimizations won't improve matters, as the primary select cannot be further optimized. Adding the usual optimization results in more rows being filtered out of the derived selection, but (as noted with the optimized correlated subquery) this won't impact performance, since the derived selection isn't then joined to anything else.

LIMIT

Consider Paul T's solution:

SELECT patientId, value
  FROM patient_details
  WHERE detailID IN (75, 151, 203)
    AND patientId = 123456
  ORDER BY detailID DESC
  LIMIT 1;

This has great performance for a single ID, scanning only 3 rows:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE patient_details NULL range PRIMARY,details_of_patient PRIMARY 8 NULL 3 100 Using where

Applying the range restriction on detailId optimization to this query does not improve performance.

Note this query only returns the results for a single patient ID. To get the results for multiple patients, the query would need to be run separately for each, resulting in performance comparable to the unoptimized correlated subquery (though with more round-trips to the DB).

outis
  • 75,655
  • 22
  • 151
  • 221