Optimized way of looping through 2 tables with one being very big

Question

Basically, I'm working with data from icu patients and studying a particular disease. For every patient, I have the disease stage during all of the patient's stay in the ICU, as you can see here:

timeline table

The other table I have it's the records table:

records table

I simply want to add to the records table the stage of the disease when that measure was made, which can be made simply by records.charttime BETWEEN timeline.starts AND timeline.ends, but I want to do it in the most optimized way possible due to the records table having nearly 100 million rows, and the timeline table over 100k rows.

I read about the usage of CURSOR, but I don't know if it makes sense using CURSOR in this case, or maybe I'm just adding complexity and it can be solved easily using JOIN.

What's the best solution?

Please provide table definition and sample data as plain text, not a link to an image. And always declare your version of Postgres. Consider instructions here: https://stackoverflow.com/tags/postgresql-performance/info — Erwin Brandstetter, Apr 20 '20 at 03:29
Are you trying to add a new column to the "records" table and populate it? Or are you just trying to view the results of a join dynamically without altering the structure of your tables? — jjanes, Apr 20 '20 at 12:16

Erwin Brandstetter · Answer 1 · 2020-04-20T18:12:55.783

Looping with an explicit cursor is typically slower than a plain join. Postgres will do the looping for you, most likely with a nested loop in the query plan.

SELECT r.*, t.stage
FROM   records       r
LEFT   JOIN timeline t ON r.icustay_id = t.icustay_id -- guessing ... ?
                      AND r.charttime >= t.starts
                      AND r.charttime <  t.ends;      -- excl. upper bound

~~BETWEEN~~ would (probably incorrectly) include the upper bound.

The LEFT JOIN preserves all rows from records, even if no match is found in timeline.

I assume ranges in timeline cannot overlap? Else, you get duplicated rows. There should probably be an exclusion constraint for table timeline. See:

Preventing adjacent/overlapping entries with EXCLUDE in PostgreSQL

Since you are processing the whole table anyway, indexes won't be of much help.

However, since your records table is so big, you may not have enough memory available to process everything in RAM, so it might pay to process partitions of your records table at a time. The best way to partition depends on undisclosed details, most importantly the physical sort order of rows in both tables.

I think the join condition also needs include `timeline.icustay_id=records.icustay_id`. — jjanes, Apr 20 '20 at 12:18

Optimized way of looping through 2 tables with one being very big

1 Answers1