0

I have a population of patients in my main query (COPD) and another population of patients in my subquery (CANC). I want to exclude the CANC PAT_ID's from the main query results but this doesn't seem to be working and runs for way too long. Is there a better way to exclude subquery results? I tried NOT EXISTS and NOT IN but don't think I was doing it correctly as patients that should have been excluded were still showing up.

SELECT DISTINCT 
        pe.PAT_ENC_CSN_ID,
        pe.PAT_ID,
        pe.CONTACT_DATE,
        vp.PAT_MRN_ID,
        vp.PAT_NAME,
        vp.SEX_NAME,
        pat.BIRTH_DATE,
        vp.AGE_YEARS,
        vp.CUR_PCP_NAME
FROM PAT_ENC pe
    INNER JOIN V_PAT_FACT vp on pe.PAT_ID=vp.PAT_ID
    INNER JOIN PATIENT pat on vp.PAT_ID=pat.PAT_ID
    INNER JOIN CLARITY_ADT adt on pe.PAT_ENC_CSN_ID=adt.PAT_ENC_CSN_ID
    LEFT OUTER JOIN PAT_ENC_DX dx on pe.PAT_ID=dx.PAT_ID
    LEFT OUTER JOIN CLARITY_EDG edg on dx.DX_ID=edg.DX_ID
    INNER JOIN GROUPER_COMPILED_RECORDS gcr on edg.DX_ID=gcr.COMPILED_REC_LIST_VALUE

------- EXCLUSION CANCER
LEFT JOIN
    (
SELECT DISTINCT pl.PAT_ID
    FROM PROBLEM_LIST pl
        LEFT OUTER JOIN CLARITY_EDG edg on pl.DX_ID=edg.DX_ID
        INNER JOIN GROUPER_COMPILED_RECORDS rec on edg.DX_ID=rec.COMPILED_REC_LIST_VALUE
    WHERE rec.GROUPER_ID in ('2100000011')
    )cx on pe.PAT_ID=cx.PAT_ID

WHERE pe.CONTACT_DATE > '2016-07-01 00:00:00.000' 
    AND pe.WEIGHT>= '1587.3'  -- 45 kg or more
    AND vp.AGE_YEARS BETWEEN '40' AND '80'  
    AND vp.SEX_C in ('1','2') --FEMALE or MALE
    AND adt.PAT_CLASS_C in ('101','103','104')  ---IP, OBS or ED
    AND vp.IS_VALID_PAT_YN = 'Y'  -- NOT TEST
    AND pat.PAT_STATUS_C <>'2' --NOT DECEASED
    AND cx.PAT_ID IS NULL
Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
  • Wouldn't be better (or more organized) to create a view with the first part of the query and then check for the exclusion? – Nick Jul 12 '17 at 20:04
  • probably but I don't have those kind of rights on our database and our dba is out of town for a bit – Krazy Klauz Jul 12 '17 at 20:10
  • 1
    can you include the query you used the `NOT IN` on. And tell us what data you were seeing that let you know a patient was in the results that shouldn't have been. (e.g. you found a patient whose SEX_C was '1' or whatever it was.) – MBurnham Jul 12 '17 at 21:25
  • I was able to use the `NOT IN` logic successfully in the example that Thorsten posted in the first answer below. – Krazy Klauz Jul 13 '17 at 13:48

3 Answers3

2

To start with: DISTINCT is often an indicator for a badly written query. It is rare to have to remove duplicates from a result in a well written query, where you avoid producing duplicates in the first place. In your main query you select data from the tables PAT_ENC, V_PAT_FACT, and PATIENT. However, you join four other tables, too, thus possibly duplicating rows. Maybe you include these tables as a means to restrict rows, i.e. you only want rows that have a match in these tables. But then, why would you try to outer join them? An outer join doesn't represent a restriction. (Besides, your outer joins become inner joins by inner joining grouper_compiled_records to them, anyway.)

As to the exclusion part: you have a failed outer join again for the same reason mentioned above. You are using an anti join which is always a bit hard to read. I don't know why NOT EXISTS and NOT IN failed for you. They are preferable over anti joins, for they are easier to read and understand.

Your query should look something like this instead:

SELECT
  pe.pat_enc_csn_id,
  pe.pat_id,
  pe.contact_date,
  vp.pat_mrn_id,
  vp.pat_name,
  vp.sex_name,
  pat.birth_date,
  vp.age_years,
  vp.cur_pcp_name
FROM pat_enc pe
JOIN v_pat_fact vp ON pe.pat_id = vp.pat_id
JOIN patient pat ON vp.pat_id = pat.pat_id
WHERE pe.pat_enc_csn_id IN 
(
  SELECT pat_enc_csn_id 
  FROM clarity_adt
  WHERE pat_class_c IN (101, 103, 104) ---IP, OBS or ED
)
AND pe.pat_id IN
(
  SELECT dx.pat_id
  FROM pat_enc_dx dx 
  JOIN clarity_edg edg on dx.dx_id = edg.dx_id
  JOIN grouper_compiled_records gcr on edg.dx_id = gcr.compiled_rec_list_value
)
AND pe.contact_date > '2016-07-01' 
AND pe.weight >= 1587.3  -- 45 kg or more
AND vp.age_years BETWEEN 40 AND 80
AND vp.sex_c IN (1, 2) -- female or male
AND vp.is_valid_pat_yn = 'Y'  -- not test
AND pat.pat_status_c <> 2 --not deceased
AND pe.pat_id NOT IN -- exclude cancer patients
(
  SELECT pl.pat_id
  FROM problem_list pl
  JOIN clarity_edg edg ON pl.dx_id = edg.dx_id
  JOIN grouper_compiled_records rec ON edg.dx_id = rec.compiled_rec_list_value
  WHERE rec.grouper_id = 2100000011
);

(This assumes that problem_list.pat_id can ever be null, as a null in the list would have NOT IN fail. You'd have to add AND pl.pat_id IS NOT NULL to your subquery in the unlikely case that the column can be null.)

The data model looks a bit strange, though. It seems a patient, identified by pat_id can have several pat_enc and several v_pat_fact. But then why would you create all their combinations for each patient? Or is there a 1:1 relation, maybe exactly one v_pat_fact per patient? But why separate tables then? And why would you store an age (which changes every year) when you know the birthday?

I cannot be sure this is exactly the query you seek, but it should come close and you should be able to adjust it according to your needs.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
  • to be a total pedant (or a pain in ..). Add declaimer to you answer regarding `NOT IN` and `NULL` values (https://stackoverflow.com/questions/129077/not-in-clause-and-null-values). Great work! – Alex Jul 13 '17 at 07:34
  • @Alex: Thanks. I've added the disclaimer. – Thorsten Kettner Jul 13 '17 at 08:19
  • Thorsten... I appreciate the in-depth response. I appear to have picked up some bad join habits but have been following the guidance I was given from our EMR vendors. The PAT_ENC provides all Patient Encounters while the PATIENT provides all demographic data and the V_PAT_FACT is a table view that provides some of the commonly requested info from numerous tables about patients (i.e. their provider, age, test patient etc...). I've mixed and matched to get the data and didn't realize that subqueries should be used for additional join types. – Krazy Klauz Jul 14 '17 at 13:37
  • Patients can have numerous encounters (pat to pat_enc is one to many) while V_PAT_FACT and PATIENT should be a one to one with the pat_ID as the foreign key. I've tried your query and I'm still excluding patients eligible for the research study and restricting patients that should be there though your query runs much faster and I appreciate your help. I think the issue may be in the "Grouper" that lists all of the diagnosis codes ("2100000011") and I'm reaching out to our vendor for clarification. Thanks for the help! – Krazy Klauz Jul 14 '17 at 13:41
1

Join Conditions

Your join conditions don't make sense. Specifically:

FROM PROBLEM_LIST pl
        LEFT OUTER JOIN CLARITY_EDG edg on pl.DX_ID=edg.DX_ID
        INNER JOIN GROUPER_COMPILED_RECORDS rec on edg.DX_ID=rec.COMPILED_REC_LIST_VALUE

In the above your LEFT JOIN to CLARITY_EDG table and then INNER JOIN to GROUPER_COMPILED_RECORDS. INNER JOIN requires that records exist in both left table and right table, therefore transforming the previous LEFT JOIN into an INNER JOIN.

Assuming that you need all tables in your join you need to change LEFT JOIN into an INNER JOIN.

The main query also needs revision.

UPDATE (thanks to @ThorstenKettner for pointing out the problem): I have removed my example query as it was not making sense.

Query Performance

DISTINCT - negatively impacts performance as SQL Server effectively has to do a self-join on the result set to check for duplicates. Check results to see if you really need it. If you do get duplicates locate the JOIN that produces them and add more join conditions.

WHERE - mismatching data types when specifying constants e.g. AND vp.AGE_YEARS BETWEEN '40' AND '80' If your AGE_YEARS column is INT make sure your BETWEEN condition specifies INTs as well e.g. BETWEEN 40 AND 80. If data types are not the same it forces SQL server to do type casting and in the case above it will cast the whole table column to string (as opposed to the constant to int) to evaluate the condition. For large tables it would not be fast also it prevents SQL Server from using indexes (if any created) on this column.

Alex
  • 4,885
  • 3
  • 19
  • 39
  • 1
    You exclude all patients that are in the problem list, as your outer join neither remove nor add patients to that list of course. (I.e. you can remove the outer join altogether; the result is the same.) This is very unlikely to be what the OP is after. I suppose the OP wants to exclude only problem patients with grouper_id = 2100000011. Apart from that you are giving good advice in your answer. – Thorsten Kettner Jul 13 '17 at 06:42
  • @ThorstenKettner - big thanks for spotting this. Answer updated. – Alex Jul 13 '17 at 07:15
0

since I don't have data to check I'm going off the assumption that your left join(exclusion join) was correct, although there is a left join then an inner join right after it, I changed it to a left join.

   SELECT DISTINCT 
            pe.PAT_ENC_CSN_ID,
            pe.PAT_ID,
            pe.CONTACT_DATE,
            vp.PAT_MRN_ID,
            vp.PAT_NAME,
            vp.SEX_NAME,
            pat.BIRTH_DATE,
            vp.AGE_YEARS,
            vp.CUR_PCP_NAME
    FROM PAT_ENC pe
        INNER JOIN V_PAT_FACT vp on pe.PAT_ID=vp.PAT_ID
        INNER JOIN PATIENT pat on vp.PAT_ID=pat.PAT_ID
        INNER JOIN CLARITY_ADT adt on pe.PAT_ENC_CSN_ID=adt.PAT_ENC_CSN_ID
        LEFT OUTER JOIN PAT_ENC_DX dx on pe.PAT_ID=dx.PAT_ID
        LEFT OUTER JOIN CLARITY_EDG edg on dx.DX_ID=edg.DX_ID
        INNER JOIN GROUPER_COMPILED_RECORDS gcr on edg.DX_ID=gcr.COMPILED_REC_LIST_VALUE

    WHERE pe.CONTACT_DATE > '2016-07-01 00:00:00.000' 
        AND pe.WEIGHT>= '1587.3'  -- 45 kg or more
        AND vp.AGE_YEARS BETWEEN '40' AND '80'  
        AND vp.SEX_C in ('1','2') --FEMALE or MALE
        AND adt.PAT_CLASS_C in ('101','103','104')  ---IP, OBS or ED
        AND vp.IS_VALID_PAT_YN = 'Y'  -- NOT TEST
        AND pat.PAT_STATUS_C <>'2' --NOT DECEASED
        ------- EXCLUSION CANCER
        AND pe.PAT_ID not in 
                                (
                                SELECT  pl.PAT_ID
                                FROM PROBLEM_LIST pl
                                    LEFT OUTER JOIN CLARITY_EDG edg on pl.DX_ID=edg.DX_ID
                                    LEFT JOIN GROUPER_COMPILED_RECORDS rec on edg.DX_ID=rec.COMPILED_REC_LIST_VALUE 
                                              AND rec.GROUPER_ID in ('2100000011')
                                )
lookslikeanevo
  • 566
  • 1
  • 5
  • 14
  • 1
    `WHERE rec.GROUPER_ID in ('2100000011')` does not make sense on a `LEFT JOIN` – Alex Jul 12 '17 at 23:25
  • You exclude *all* patients that are in the problem list, as your outer joins neither remove nor add patients to that list of course. (I.e. you can remove the outer joins altogether; the result is the same.) This is very unlikely to be what the OP is after. I suppose the OP wants to exclude only problem patients with `grouper_id = 2100000011`. – Thorsten Kettner Jul 13 '17 at 06:39