Select distinct count across cases

Question

I have a database of patients with a VITALS table. This table contains a unique patient ID (PATID) per patient and a height variable (HT). A single patient may have >1 height recorded.

I'm trying to return a count of unique PATIDs within and across height ranges (e.g., 68-72", 72-76", etc.). Each PATID should be counted *only once*. However what I'm finding is that if a patient has multiple heights recorded, they'll be counted once within a range, but if their height crosses ranges, they'll be counted twice - once in each range.

E.g., if a patient has height recorded as 68, 72, and 73 they'll be counted once in the 68-72 range and once in the 72-76 range. I can tell this is happening because we have 3054 unique PATIDs, but the sum of the counts returned by the query is >5000.

My code is:

SELECT 
    CASE
        when "HT" >0 and "HT" <=4 then '0-4'
        when "HT" >4 and "HT" <=8 then '4-8'
        when "HT" >8 and "HT" <=12 then '8-12'
        when "HT" >12 and "HT" <=16 then '12-16'
        when "HT" >16 and "HT" <=20 then '16-20'
        when "HT" >20 and "HT" <=24 then '29-24'
        when "HT" >24 and "HT" <=28 then '24-28'
        when "HT" >28 and "HT" <=32 then '28-32'
        when "HT" >32 and "HT" <=36 then '32-36'
        when "HT" >36 and "HT" <=40 then '36-40'
        when "HT" >40 and "HT" <=44 then '40-44'
        when "HT" >44 and "HT" <=48 then '44-48'
        when "HT" >48 and "HT" <=52 then '48-52'
        when "HT" >52 and "HT" <=56 then '52-56'
        when "HT" >56 and "HT" <=60 then '56-60'
        when "HT" >60 and "HT" <=64 then '60-64'
        when "HT" >64 and "HT" <=68 then '64-68'
        when "HT" >68 and "HT" <=72 then '68-72'
        when "HT" >72 and "HT" <=76 then '72-76'
        when "HT" >76 and "HT" <=80 then '76-80'
        when "HT" >80 and "HT" <=84 then '80-84'
        when "HT" >84 and "HT" <=88 then '84-88'
        when "HT" IS NULL then 'Null'
        else '>88'    
    END AS "Height Range",            
    COUNT(DISTINCT vital."PATID") AS "Count"
FROM dbo."VITAL" vital
GROUP BY 1;

When the patient belongs to multiple HT ranges, why would one be prefered over another? It seems both the problem definition and the query are missing that rule. Maybe you want `PATID,max(HT) GROUP BY 1` and then categorize that into ranges. — Daniel Vérité, Aug 18 '15 at 13:38
`if a patient has height recorded as 68, 72, and 73 ...` Obviously, you must define which row to pick. And always provide your Postgres version. — Erwin Brandstetter, Aug 18 '15 at 13:38

score 0 · Answer 1 · answered Aug 18 '15 at 13:38

In the case when a patient has multiple records you have to choose what record you want.

One solution would be to change the source to only get the max height like this:

FROM (select "PATID", max("HT") "HT" from dbo."VITAL" GROUP BY "PATID") vital

Or maybe you could take the minimum or average of the records - the appropriate solution depends on your requirement.

score 0 · Accepted Answer · edited May 23 '17 at 12:29

You might fold duplicates in a subquery before you do the count:

SELECT CASE
        WHEN "HT" IS NULL THEN 'Null'
        WHEN "HT" <= 4  THEN '0-4'
        WHEN "HT" <= 8  THEN '4-8'
        WHEN "HT" <= 12 THEN '8-12'
        WHEN "HT" <= 16 THEN '12-16'
        WHEN "HT" <= 20 THEN '16-20'
        WHEN "HT" <= 24 THEN '29-24'
        WHEN "HT" <= 28 THEN '24-28'
        WHEN "HT" <= 32 THEN '28-32'
        WHEN "HT" <= 36 THEN '32-36'
        WHEN "HT" <= 40 THEN '36-40'
        WHEN "HT" <= 44 THEN '40-44'
        WHEN "HT" <= 48 THEN '44-48'
        WHEN "HT" <= 52 THEN '48-52'
        WHEN "HT" <= 56 THEN '52-56'
        WHEN "HT" <= 60 THEN '56-60'
        WHEN "HT" <= 64 THEN '60-64'
        WHEN "HT" <= 68 THEN '64-68'
        WHEN "HT" <= 72 THEN '68-72'
        WHEN "HT" <= 76 THEN '72-76'
        WHEN "HT" <= 80 THEN '76-80'
        WHEN "HT" <= 84 THEN '80-84'
        WHEN "HT" <= 88 THEN '84-88'
        ELSE                 '>88'    
    END AS "Height Range",            
    count(*) AS "Count"  -- DISTINCT not needed any more
FROM (
   SELECT DISTINCT ON ("PATID")  -- get greatest "HT" per patient
          "PATID", "HT"
   FROM   dbo."VITAL"
   ORDER  BY "PATID", "HT" DESC NULLS LAST
   ) sub
GROUP BY 1;

I also removed redundant checks from your CASE statement - assuming negative heights are not possible (you should have a CHECK constraint).

Detailed explanation for DISTINCT ON:

Select first row in each GROUP BY group?

Or use an aggregate in the subquery like @jpw suggested.

I was thinking a sub-query might be the way to go. I'm pretty new to SQL, so I just want to double check - in your example, we're essentially creating a table of PATIDs and HT, ordered by HT descending. Then we're selecting distinct PATIDs. The SELECT ON means that we're selecting the **first** instance of the HT associated with a PATID (aka the largest height for that PATID). Thanks! — Melanie, Aug 18 '15 at 18:51
@Melanie: The `ORDER BY` is only necessary to select the greatest `"HT"` in combination with `DISTINCT ON`. And yes, you basically got it. I added a link for detailed explanation. — Erwin Brandstetter, Aug 18 '15 at 21:30

Select distinct count across cases

2 Answers2