How to optimize query to compute row-dependent datetime relationships?

Question

Say I have a simplified model in which a patient can have zero or more events. An event has a category and a date. I want to support questions like:

Find all patients that were given a medication after an operation and 
the operation happened after an admission.

Where medication, operation and admission are all types of event categories. There are ~100 possible categories.

I'm expecting 1000s of patients and every patient has ~10 events per category.

The naive solution I came up with was to have two tables, a patient and an event table. Create an index on event.category and then query using inner-joins like:

SELECT COUNT(DISTINCT(patient.id)) FROM patient
INNER JOIN event AS medication
    ON  medication.patient_id = patient.id
    AND medication.category = 'medication'
INNER JOIN event AS operation
    ON  operation.patient_id = patient.id
    AND operation.category = 'operation'
INNER JOIN event AS admission
    ON  admission.patient_id = patient.id
    AND admission.category = 'admission'
WHERE medication.date > operation.date
    AND operation.date > admission.date;

However this solution does not scale well as more categories/filters are added. With 1,000 patients and 45,000 events I see the following performance behaviour:

| number of inner joins | approx. query response |
| --------------------- | ---------------------- |
| 2                     | 100ms                  |
| 3                     | 500ms                  |
| 4                     | 2000ms                 |
| 5                     | 8000ms                 |

Explain:

Does anyone have any suggestions on how to optimize this query/data model?

Extra info:

Postgres 10.6
In the Explain output, project_result is equivalent to patient in the simplified model.

Advanced use case:

Find all patients that were given a medication within 30 days after an 
operation and the operation happened within 7 days after an admission.

thanks @ErwinBrandstetter for the heads up, added the postgres version. — shusson, Jan 02 '19 at 15:15
What you've done with your event table is called an EAV model. It has pro and cons. I personnaly like it but I know what its limit are. One of the (few) cons is the perf issue you are experimenting. There's no miracle solution here. Changing the model completely is one of them, creating materialized views for the most queried events is another. You'll find find plenty of ideas if you google *EAV perf issue*. — Thomas G, Jan 02 '19 at 15:37
@ThomasG: My suggested rewrite should result in 3 index-only scans (under ideal read conditions) and perform like said "miracle" in comparison. — Erwin Brandstetter, Jan 02 '19 at 15:39
@ErwinBrandstetter Your solution is nice (I upvoted) however it has its limits and you know it :) You work with 3 categories, but what will happens with 10 ? With 3 cats he had 500ms, with 5 he had 8000. The same will happen with your approach, but not at the same scale hopefully. — Thomas G, Jan 02 '19 at 15:42
@ThomasG: Sure, there are limits. But this should be fast for *any* number of levels because each added level reduces the number of rows for the next step. So it should scale well in this respect - unless each level is not selective at all. It helps most if early steps are selective ... Maybe we get feedback with results from the OP. — Erwin Brandstetter, Jan 02 '19 at 15:47
@ThomasG: On second thought, this case is probably broken by design, but I added a yet more efficient query for similar cases that work like this one probably doesn't. A `LATERAL` join to pick exactly *1* qualifying row for each additional step. Very fast with index support. Might be useful for the related case you mentioned ... — Erwin Brandstetter, Jan 02 '19 at 16:56
Yeah sorry @ErwinBrandstetter I'm in the middle of testing your current answer, which looks promising. If I find issues I'll accept your current answer and create a new question, but it'll probably be tomorrow. — shusson, Jan 02 '19 at 19:25
Added a proper solution for your advanced case. The fine print depends on undisclosed data types and constraints. — Erwin Brandstetter, Jan 02 '19 at 19:47

Erwin Brandstetter · Accepted Answer · 2019-01-02T20:16:29.543

First, if referential integrity is enforced with FK constraints, you can drop the patient table from the query completely:

SELECT COUNT(DISTINCT patient)  -- still not optimal
FROM   event a
JOIN   event o USING (patient_id)
JOIN   event m USING (patient_id)
WHERE  a.category = 'admission'
AND    o.category = 'operation'
AND    m.category = 'medication'
AND    m.date > o.date
AND    o.date > a.date;

Next, get rid of the repeated multiplication of rows and the DISTINCT to counter that in the outer SELECT by using EXISTS semi-joins instead:

SELECT COUNT(*)
FROM   event a
WHERE  EXISTS (
   SELECT FROM event o
   WHERE  o.patient_id = a.patient_id
   AND    o.category = 'operation'
   AND    o.date > a.date
   AND    EXISTS (
      SELECT FROM event m
      WHERE  m.patient_id = a.patient_id
      AND    m.category = 'medication'
      AND    m.date > o.date
      )
   )
AND    a.category = 'admission';

Note, there can still be duplicates in the admission, but that's probably a principal problem in your data model / query design, and would need clarification as discussed in the comments.

If you indeed want to lump all cases of the same patient together for some reason, there are various ways to get the earliest admission for each patient in the initial step - and repeat a similar approach for every additional step. Probably fastest for your case (re-introducing the patient table to the query):

SELECT count(*)
FROM   patient p
CROSS  JOIN LATERAL ( -- get earliest admission
   SELECT e.date
   FROM   event e
   WHERE  e.patient_id = p.id 
   AND    e.category = 'admission'
   ORDER  BY e.date
   LIMIT  1
   ) a
CROSS  JOIN LATERAL ( -- get earliest operation after that
   SELECT e.date
   FROM   event e
   WHERE  e.patient_id = p.id 
   AND    e.category = 'operation'
   AND    e.date > a.date
   ORDER  BY e.date
   LIMIT  1
   ) o
WHERE EXISTS (  -- the *last* step can still be a plain EXISTS
      SELECT FROM event m
      WHERE  m.patient_id = p.id
      AND    m.category = 'medication'
      AND    m.date > o.date
      );

See:

You might optimize your table design by shortening the lengthy (and redundant) category names. Use a lookup table and only store an integer (or even int2 or "char" value as FK.)

For best performance (and this is crucial) have a multicolumn index on (parent_id, category, date DESC) and make sure all three columns are defined NOT NULL. The order of index expressions is important. DESC is mostly optional here. Postgres can use the index with default ASC sort order almost as efficiently in your case.

If VACUUM (preferably in the form of autovacuum) can keep up with write operations or you have a read-only situation to begin with, you'll get very fast index-only scans out of this.

You should really have case IDs or something more definitive to tie operation to admission and medication to operation etc. where relevant. (Could simply be the id of the referenced event!) Dates / timestamps alone are error-prone.

SELECT COUNT(*)                    -- to count cases
   --  COUNT(DISTINCT patient_id)  -- to count patients
FROM   event a
WHERE  EXISTS (
   SELECT FROM event o
   WHERE  o.patient_id = a.patient_id
   AND    o.category = 'operation'
   AND    o.date >= a.date      -- or ">"
   AND    o.date <  a.date + 7  -- based on data type "date"!
   AND    EXISTS (
      SELECT FROM event m
      WHERE  m.patient_id = a.patient_id
      AND    m.category = 'medication'
      AND    m.date >= o.date       -- or ">"
      AND    m.date <  o.date + 30  -- syntax for timestamp is different
      )
   )
AND    a.category = 'admission';

About date / timestamp arithmetic:

How to get the end of a day?

Your final query still needs to `count(distinct patient_id)`s to prevent patients with multiple `admission`s from being counted multiple times. Other than that it's an excellent answer and has my upvote. — Sentinel, Jan 02 '19 at 16:00
@Sentinel: Good point. And true. But this might hint at a logical flaw in the data model / query to begin with. Since the same patient should only be admitted for the same operation *once*. Currently, it would seem that a treatment for a later admission (for a different operation) will create false positives. What's missing is a "case id" or some such to tie the various categories to the applicable case. — Erwin Brandstetter, Jan 02 '19 at 16:11
@ErwinBrandstetter yes I may have oversimplified, but I do want to avoid the false negatives you have mentioned. I was hoping to use date functions like `date_truncate` to support use cases like `find patients that were given a medication after an operation but on the same day`. Ideally I want to support something like `find patients that were given a medication within x days of an operation`. I'll update the question with these extra details. — shusson, Jan 02 '19 at 19:00
@shusson: I'd suggest to start a ***new*** question with exact details. You can add links to provide context. Changing the question substantially after answers have been given regularly creates a mess. Adding your actual table definition showing data types and constraints helps *a lot*. Also: do you want to count *patients* or *cases*? (The same patient might have been treated successfully on different occasions ...) — Erwin Brandstetter, Jan 02 '19 at 19:15
@ErwinBrandstetter for now I would constrain the scope to just patients counts. I would like cases but I'll ask a separate question for that if I need to. — shusson, Jan 02 '19 at 19:31
Thanks for your answer @ErwinBrandstetter, the query time for 5 categories was reduced from ~8000ms to ~100ms! The performance is still degrading slightly as more filters/categories are added, and I'll continue testing, but for now this is sufficient for my use case. — shusson, Jan 02 '19 at 20:05
BTW: your final query can be reduced to a single `EXISTS( event e1 JOIN event e2 ...)` No need for the nested `EXISTS(e2)`, IMHO. — wildplasser, Jan 02 '19 at 20:24
@wildplasser: Yes. Would be a bit shorter. Probably the same query plan, not sure. — Erwin Brandstetter, Jan 03 '19 at 03:23

score 2 · Answer 2 · answered Jan 02 '19 at 16:11

You might find that conditional aggregation does what you want. The time component can be difficult to handle (see below), if your sequences get complicated, but the basic idea:

select e.patient_id
from events e
group by e.patient_id
having (max(date) filter (where e.category = 'medication') > 
        min(e.date) filter (where e.category = 'operation')
       ) and
       (min(date) filter (where e.category = 'operation') >
        min(e.date) filter (where e.category = 'admission'
       );

This can be generalized for further categories.

Using group by and having should have the consistent performance characteristics that you want (although for simple queries is might be slower). The trick with this -- or any approach -- is what happens when there are multiple categories for a given patient.

For instance, this or your approach will find:

admission --> operation --> admission --> medication

I suspect that you don't really want to find these records. You probably need an intermediate level, representing some sort of "episode" for a given patient.

If that that is the case, you should ask another question with clearer examples both of the data, questions you might want to ask, and cases that match and do not match the conditions.

Sounds like you are addressing the same design problem as this comment (almost at the same time, too): https://stackoverflow.com/questions/54008579/how-to-optimize-query-to-compute-row-dependent-datetime-relationships-in-postgre/54008917#comment94855846_54008917 — Erwin Brandstetter, Jan 02 '19 at 16:16
In our model I don't think we can create something like an episode, i.e the data is created outside of our control. My initial thoughts around this issue was to use date functions to enable things like "find medication that happened on the same day as an admission". I was trying to simplify the problem, but maybe I simplified too much. — shusson, Jan 02 '19 at 18:31

How to optimize query to compute row-dependent datetime relationships?

2 Answers2

Linked