Combine the most recent entries from a number of tables

Question

I have a master table with a number of IDs in it:

ID  ...
0   ...
1   ...

And multiple tables (say vtbl1, vtbl2, vtbl3) with a foreign key to master, a timestamp and a value:

ID  Timestamp    Value
0   01/01/01..   2
1   01/01/02..   7
0   01/01/03..   5

I would like to get one or more entries for each ID in master with an entry (or null if no entries exist) containing the most recent entry in each v... table (grouped by timestamps):

ID  Timestamp    vtbl1.Value   vtbl2.Value   vtbl3.value
0   01/01/03..   5             2
0   01/01/01..                               4
1   01/01/02..   7             4             9

I'm sure this is fairly simple but my SQL is rusty and I've been going in circles. Any help would be appreciated.

Clarification

These values come from one or more sensors able to read one or more of the values. So the latest value in each value table for the ID is to be considered the current system state for that ID. If the timestamps match they are considered one update.

I need the minimal set of updates required for each ID to give a full data set for the current state.

Also the values can be of different types.

If you have multiple timestamps associated with the same `id` in the child tables, what determines the `max(timestamp)` to display in the results? The max in one might not be the max in the other... — sgeddes, Mar 29 '15 at 20:26
There may be multiple rows for each id. See ID 0 in the example. — Hector, Mar 29 '15 at 20:28

sgeddes · Answer 1 · 2015-03-29T20:23:06.137

1

If I understand your question correctly, one option is to use conditional aggregation and union all:

select id, timestamp, 
       max(case when tbl = 'tbl1' then value end) t1value,
       max(case when tbl = 'tbl2' then value end) t2value,
       max(case when tbl = 'tbl3' then value end) t3value
from (
    select id, timestamp, value, 'tbl1' tbl
    from tbl1
    union all
    select id, timestamp, value, 'tbl2' tbl
    from tbl2
    union all
    select id, timestamp, value, 'tbl3' tbl
    from tbl3
) t
group by id, timestamp

Or if you have multiple records per id and you want the highest value per by timestamp, you can include row_number() in your subquery:

select id, timestamp, 
       max(case when tbl = 'tbl1' then value end) t1value,
       max(case when tbl = 'tbl2' then value end) t2value,
       max(case when tbl = 'tbl3' then value end) t3value
from (
    select id, timestamp, value, 'tbl1' tbl,
        row_number() over (partition by id order by timestamp desc) rn
    from tbl1
    union all
    select id, timestamp, value, 'tbl2' tbl,
        row_number() over (partition by id order by timestamp desc) rn
    from tbl2
    union all
    select id, timestamp, value, 'tbl3' tbl,
        row_number() over (partition by id order by timestamp desc) rn
    from tbl3
) t
where rn = 1
group by id, timestamp

This can get difficult though if max(timestamp) values aren't the same in each of the child tables. Which do you join on at that point?

edited Mar 29 '15 at 20:23

answered Mar 29 '15 at 20:12

sgeddes

62,311
6
61
83

This will grab the highest `value` from each of the child tables rather than the `value` associated with the highest `timestamp` for each `id` – Brian DeMilia Mar 29 '15 at 20:18
@BrianDeMilia -- I see your point and added an alternative issue. We both might have a problem if this is the case though -- what determines the `max(timestamp)` from child tables to join on? At this point, additional data with sample results would be needed for clarification. – sgeddes Mar 29 '15 at 20:24
I'll add some clarification to the question - give me a minute. The short version is I need the smallest number of rows to give a complete most recent dataset (a complete data set being at least one value for every value table that has an entry associated with the ID). – Hector Mar 29 '15 at 20:28
Agreed, it's unclear what timestamp he wants shown as he only has one timestamp column in his expected output, when there is likely a different timestamp on each table that is the highest for the given id. I could understand if he had shown 3 timestamp columns, in which case he'd want the timestamp that was associated with each of the 3 values. – Brian DeMilia Mar 29 '15 at 20:28
Added clarification. I I want the timestamp associated with the updates. If there are multiple timestamps there should be multiple rows in the output. Cheers. – Hector Mar 29 '15 at 20:38

score 0 · Answer 2 · answered Mar 29 '15 at 20:17

select m.*, v1.value as t1_val, v2.value as t2_val, v3.value as t3_val
  from master m
  left join (select x.*
               from vtbl1 x
               join (select id, max(timestamp) as last_ts
                      from vtbl1
                     group by id) y
                 on x.id = y.id
                and x.timestamp = y.last_ts) v1
    on m.id = v1.id
  left join (select x.*
               from vtbl2 x
               join (select id, max(timestamp) as last_ts
                      from vtbl2
                     group by id) y
                 on x.id = y.id
                and x.timestamp = y.last_ts) v2
    on m.id = v2.id
  left join (select x.*
               from vtbl3 x
               join (select id, max(timestamp) as last_ts
                      from vtbl3
                     group by id) y
                 on x.id = y.id
                and x.timestamp = y.last_ts) v3
    on m.id = v3.id

score 0 · Answer 3 · edited May 23 '17 at 11:58

The fastest query technique depends on the distribution of values. DISTINCT ON would be a simple solution in Postgres, ideal for just a few values per id in each child table. But guessing from your description I expect many rows per id, so I suggest a solution with LATERAL joins. Requires Postgres 9.3+:

Optimize GROUP BY query to retrieve latest record per user

One more complication for your already-not-so-simple case:

Also the values can be of different types

Alternative 1

Cast all values to text. Every data type can be cast to text.

Base query

SELECT m.id, v.timestamp, 1 AS tbl, v.value  -- simple int as table id
FROM   master m
     , LATERAL (
   SELECT timestamp, value::text  -- cast to text
   FROM   vtbl1
   WHERE  id = m.id  -- lateral reference
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) v

UNION ALL
SELECT m.id, v.timestamp, 2 AS tbl, v.value  -- ascending without gaps
FROM   master m
     , LATERAL (
   SELECT timestamp, value::text
   FROM   vtbl2
   WHERE  id = m.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) v

UNION ALL
SELECT m.id, v.timestamp, 3 AS tbl, value
FROM  ...
;

All you need for this to be fast is an index on (id, timestamp) for each child table. Best in this form (adding value is only useful if you get index-only scans out of it):

CREATE INDEX vtbl1_combo_idx ON vtbl1 (id, timestamp DESC NULLS LAST, value)

1a. Aggregate (pseudo-crosstab)

To format as desired use aggregate functions on CASE expressions in Postgres 9.3 or older (like demonstrated by @sgeddes) or (better) the new aggregate FILTER clause in Postgres 9.4+:

How can I simplify this game statistics query?

SELECT id, timestamp
     , max(value) FILTER (WHERE tbl = 1) AS val1
     , max(value) FILTER (WHERE tbl = 2) AS val2
     , ...
FROM ( <query frm above> ) t
GROUP  BY 1, 2;

1b. Crosstab

Actual cross tabulation (also called "pivot" in other RDBMS) should be considerably faster. You need the additional module tablefunc installed, instructions below.

The special difficulty here: we have a composite "row name" (id, timestamp), but the function expects a single column as row name. So we substitute with row_number(), but do not display that surrogate key in the result:

SELECT id, timestamp, val1, val2, val3, ...
 -- normally SELECT * is enough; explicit list to filter rn
FROM  crosstab(
    $$
    SELECT row_number() OVER (ORDER BY id, timestamp DESC NULLS LAST) AS rn
         , id, timestamp, tbl, value
    FROM  ( <query from above> ) t
    ORDER  BY 1
    $$
  , 'SELECT generate_series(1,3)'  -- replace 3 with highest table nr.
    ) AS ct (
    rn int, id int, timestamp date
  , val1 text, val2 text, val3 text, ...);

Closely related:

Postgres - Transpose Rows to Columns

Relevant basics:

Alternative 2

Simple, but may be just as fast and preserves original data types:

SELECT id, timestamp
     , max(val1) AS val1, max(val2) AS val2, max(val3) AS val3, ...
FROM  (
   SELECT m.id, v.timestamp
        , v.value AS val1, NULL::int AS val2, NULL::numeric AS val3, ...   
          -- list all values with actual data type
   FROM   master m
        , LATERAL (
      SELECT timestamp, value
      FROM   vtbl1
      WHERE  id = m.id
      ORDER  BY timestamp DESC NULLS LAST
      LIMIT  1
      ) v

   UNION ALL
   SELECT m.id, v.timestamp
        , NULL, v.value, NULL, ...  -- column names & data types defined in first SELECT
   FROM   master m
        , LATERAL (
      SELECT timestamp, value
      FROM   vtbl2
      WHERE  id = m.id
      ORDER  BY timestamp DESC NULLS LAST
      LIMIT  1
      ) v

   UNION ALL
   SELECT m.id, v.timestamp
        , NULL, NULL, v.value, ...
   FROM  ...
   ) t
GROUP  BY 1, 2
ORDER  BY 1, 2;

Aside: Never use basic type names or reserved words (in standard SQL) like timestamp as identifier.

Cheers. Will look at properly and test when I get back. – Hector Mar 30 '15 at 12:11 — Hector, Mar 30 '15 at 12:11