How to select the first part of string that has a match in list of values in PostgreSQL?

Question

I'm working with medical data and I want to find the primary diagnosis given to certain patient.

The diagnoses are all in one column as a comma separated string. For example "A10.3,B55.2, A10.1". Lets call this table patients and the column with the diagnoses diag_list. I want to create a new column that contains the first part from diag_list that has a match in different table that contains a list of diagnosis that could be considered "primary". Lets call this table ICD10 and the column with the list of primary diagnoses as primary_diag.

The reason I want just the first match is that diag_list is already sorted according to the severity of the symptoms. So I'm trying to find the diagnosis with most severe symptoms that could also be considered a primary diagnosis.

I tried to first convert the diag_list to an array with string_to_array, but could not find a way to conditionally choose the first match from this new array. How would you do this choosing? Or is there some completely different way to arrive to the same conclusion that would be simpler and/or more efficient?

Please add some data sample including requested result – David דודו Markovitz Feb 12 '17 at 21:36 — David דודו Markovitz, Feb 12 '17 at 21:36

score 2 · Accepted Answer · edited May 23 '17 at 12:25

Once you have your string_to_array you need to unnest it. Then, you need to join with ICD10, and, for each patient, take the first diagnostic.

Assume we have the following data:

CREATE TABLE patients
(
    patient_id integer PRIMARY KEY,
    diag_list text NOT NULL
) ;

INSERT INTO patients 
VALUES
    (1, 'A10.3,B55.2, A10.1') ,
    (2, 'A10.3, A10.1, C20.2') ;

CREATE TABLE ICD10
(
    primary_diag text PRIMARY KEY,
    diagnose text
) ;

INSERT INTO ICD10
VALUES
    ('B55.2', 'Something Bad'),
    ('A10.1', 'Somehitng Worse');

With this data, we can start to unnest with ordinality your data:

SELECT
    patient_id, trim(diag) AS diag, nr
FROM
    patients
    JOIN LATERAL unnest(string_to_array(diag_list, ',')) 
         WITH ORDINALITY AS a(diag, nr) ON true ;

and get

+------------+-------+----+
| patient_id | diag  | nr |
+------------+-------+----+
|          1 | A10.3 |  1 |
|          1 | B55.2 |  2 |
|          1 | A10.1 |  3 |
|          2 | A10.3 |  1 |
|          2 | A10.1 |  2 |
|          2 | C20.2 |  3 |
+------------+-------+----+

Next step: join this data with ICD10

WITH patients_and_diags AS
(
SELECT
    patient_id, trim(diag) AS diag, nr
FROM
    patients
    JOIN LATERAL unnest(string_to_array(diag_list, ',')) 
         WITH ORDINALITY AS a(diag, nr) ON true
)
SELECT
    patient_id, diag, nr, diagnose
FROM
    patients_and_diags
    JOIN ICD10 ON ICD10.primary_diag = patients_and_diags.diag ;

... and get:

+------------+-------+----+-----------------+
| patient_id | diag  | nr |    diagnose     |
+------------+-------+----+-----------------+
|          1 | B55.2 |  2 | Something Bad   |
|          1 | A10.1 |  3 | Somehitng Worse |
|          2 | A10.1 |  2 | Somehitng Worse |
+------------+-------+----+-----------------+

Now we need to take only the smallest 'nr' for each patient_id

The following query, does everything in one step

WITH patients_and_diags AS
(
SELECT
    patient_id, trim(diag) AS diag, nr
FROM
    patients
    JOIN LATERAL unnest(string_to_array(diag_list, ',')) 
         WITH ORDINALITY AS a(diag, nr) ON true
)
, patients_and_ICD10 AS
(
SELECT
    patient_id, diag, nr, diagnose
FROM
    patients_and_diags
    JOIN ICD10 ON ICD10.primary_diag = patients_and_diags.diag
)
, first_ICD10 AS
(
SELECT
    patient_id, min(nr) AS nr
FROM
    patients_and_ICD10
GROUP BY
    patient_id
)
SELECT 
    patient_id, diag, diagnose
FROM 
    first_ICD10 
    JOIN patients_and_ICD10 USING(patient_id, nr) ;

... and get you:

+------------+-------+-----------------+
| patient_id | diag  |    diagnose     |
+------------+-------+-----------------+
|          1 | B55.2 | Something Bad   |
|          2 | A10.1 | Somehitng Worse |
+------------+-------+-----------------+

You can check everything at http://rextester.com/RDYPD23700

The query could be made shorter by using some WINDOW functions; but I think this step-by-step approach is more clear.

Nice answer. I think the `min(nr)` is better done with a window function. Didn't know about rextester.com, looks much faster than sqlfiddle! — Andomar, Feb 12 '17 at 22:39
@Andomar: yeah, probably just taking the `first_value()` of a window `grouping by patient` and `ordering by nr`. But I'm not sure putting all optimizations (and too many concepts) at once is always the best idea. Let's not ["optimize prematurely"](http://wiki.c2.com/?PrematureOptimization) ;-) — joanolo, Feb 12 '17 at 22:42
Thank you! This solved my problem and your explanation was very easily readable. I was not aware of the `with ordinality` keyword and it will probably come in handy in multiple situations with the data I'm handling now. — Tuomo Kareoja, Feb 13 '17 at 09:28

score 1 · Answer 2 · edited May 23 '17 at 12:32

1

This is tricky. Starting in Postgres 9.4, you have the with ordinality keyword for unnest(). This includes a position column. This, in combination with some other stuff, will do what you need:

select distinct on (p.patientid) p.*, d.*
from patients p, later
     unnest(string_to_array(p.diag_list, ',')) with ordinality dp(code, ord) join
     diagnoses d
     on d.code = dp.code
order by p.patientid, ord asc;

Not surprisingly, Erwin Brandstetter has a discussion of this problem and how to solve it.

edited May 23 '17 at 12:32

Community

1
1

answered Feb 12 '17 at 21:34

Gordon Linoff

1,242,037
58
646
786

Unfortunately the "first' is conditional: "the first part from diag_list that has a match in different table" – Andomar Feb 12 '17 at 21:34

Andomar · Answer 3 · 2017-02-12T22:41:15.680

You can use unnest to "normalize" the diagnostics. That means you have one row per patient-diagnostic combination. Use generate_subscripts to add the position of that diagnostic in the list. (For Postgres 9.4 and later, with ordinality is better, as suggested in the other answers.) You can use the position to order the diagnistics, and filter for the highest primary diagnostic in the list:

with    normal_pat as
        (
        select  name
        ,       unnest(string_to_array(diag_list, ',')) as diag
        ,       generate_subscripts(string_to_array(diag_list, ','),1) as pos
        from    patients
        )
,       numbered_pat as
        (
        select  row_number() over (partition by name order by pos) rn
        ,       *
        from    normal_pat
        join    diagnostics d
        on      normal_pat.diag = d.primary_diag
        )
select  name
,       diag
,       pos as position_of_diagnostic_in_list
from    numbered_pat
where   rn = 1

Here is a working example at SQL Fiddle or rextester.

How to select the first part of string that has a match in list of values in PostgreSQL?

3 Answers3