Return data aggregated by one column id and data-aligned to another column id

Question

I have a large table with millions of rows like this:

CREATE TABLE mytable (
    row_id bigint,
    col_id bigint,
    value double precision,
    timestamp timestamp
);

Given:

list_row = a list of row_ids (can be ordered if needed)
list_col = a list of col_ids (again, can be ordered if needed)
Both lists may be very large (maybe 10s of thousands)
The table above may have many millions of entries

How do I (efficiently) return a resource where:

Columns are all the col_ids present in list_col and occur in the same order that the col_ids occur in list_col
Rows are all the row_ids present in list_row (they need not occur in the same order)
Each field contains the values of the given row_id and col_ids.
We are only interested in the most recently recorded values for any row_id:col_id pair i.e. use MAX(timestamp) or something similar as a filter
In the result, if there is no recorded value for a given row_id:col_id co-ordinate then that field should be null.

A visual example to clarify. The initial table:

+--------+--------+-------+-----------+
| row_id | col_id | value | timestamp |
+========+========+=======+===========+
|   10   |   20   |  100  | 2016-0... |
|   10   |   21   |  200  | 2015-0... |
|   11   |   20   |  300  | 2016-1... |
|   11   |   22   |  400  | 2016-0... |
+--------+--------+-------+-----------+

becomes:

                  col_id →
            +-----------------+
            | 20  | 21  | 22  |
            +=====+=====+=====+
row_id (10) | 100 | 200 |     |
   ↓   (11) | 300 |     | 400 |
            +-----+-----+-----+

I suspect that the correct answer is to start by creating a temporary table with the target col_ids as columns and then do some sort of join. I cannot work out how to do this efficiently. Is it possible to do this without needing a temporary table for each row_id?

score 1 · Accepted Answer · edited May 23 '17 at 11:51

crosstab() would work for regular queries:

PostgreSQL Crosstab Query

But not for your case because of:

Both lists may be very large (maybe 10s of thousands)

That's too many columns for Postgres. The manual:

There is a limit on how many columns a table can contain. Depending on the column types, it is between 250 and 1600. However, defining a table with anywhere near this many columns is highly unusual and often a questionable design.

I suggest to return arrays instead. Something like (works for any modern Postgres version):

SELECT row_id
     , array_agg(col_id) AS cols
     , array_agg(value)  AS vals
FROM  (
   SELECT DISTINCT ON (row_id, col_id)  --  most recent values for row_id:col_id pair 
          row_id, col_id, value
   FROM   mytable
   WHERE  row_id IN (<long list>)
   AND    col_id IN (<long list>)
   ORDER  BY row_id, col_id, timestamp DESC
   ) sub
GROUP   BY 1;

About DISTINCT ON:

Select first row in each GROUP BY group?

A couple of alternative ways to return the data:

SELECT json_agg(json_build_object('col_id', col_id
                                , 'value' , value)) AS col_values1  -- requires pg 9.4+
     , json_agg(json_build_object(col_id, value))   AS col_values2  -- requires pg 9.4+
     , array_agg(ARRAY[col_id, value])              AS col_values3  -- requires pg 9.5+
     , array_agg(hstore(col_id::text, value::text)) AS col_values4  -- requires pg 8.3+
FROM  ...  -- same as above

The last one required the additional module hstore.

Thank you so much! This is _exactly_ what I needed, and addresses all the design constraints. You are awesome! — Tintin, Jul 18 '16 at 21:11
@Paul: Thanks, I added some alternative ways to return the data. — Erwin Brandstetter, Jul 19 '16 at 01:47

Return data aggregated by one column id and data-aligned to another column id

1 Answers1