Time series querying in Postgres

Question

This is a follow on question from @Erwin's answer to Efficient time series querying in Postgres.

In order to keep things simple I'll use the same table structure as that question

id | widget_id | for_date | score |

The original question was to get score for each of the widgets for every date in a range. If there was no entry for a widget on a date then show the score from the previous entry for that widget. The solution using a cross join and a window function worked well if all the data was contained in the range you were querying for. My problem is I want the previous score even if it lies outside the date range we are looking at.

Example data:

INSERT INTO score (id, widget_id, for_date, score) values
(1, 1337, '2012-04-07', 52),
(2, 2222, '2012-05-05', 99),
(3, 1337, '2012-05-07', 112),
(4, 2222, '2012-05-07', 101);

When I query for the range May 5th to May 10th 2012 (ie generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')) I would like to get the following:

DAY          WIDGET_ID  SCORE
May, 05 2012    1337    52
May, 05 2012    2222    99
May, 06 2012    1337    52
May, 06 2012    2222    99
May, 07 2012    1337    112
May, 07 2012    2222    101
May, 08 2012    1337    112
May, 08 2012    2222    101
May, 09 2012    1337    112
May, 09 2012    2222    101
May, 10 2012    1337    112
May, 10 2012    2222    101

The best solution so far (also by @Erwin) is:

SELECT a.day, a.widget_id, s.score
FROM  (
   SELECT d.day, w.widget_id
         ,max(s.for_date) OVER (PARTITION BY w.widget_id ORDER BY d.day) AS effective_date
   FROM  (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d
   CROSS  JOIN (SELECT DISTINCT widget_id FROM score) AS w
   LEFT   JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
   ) a
LEFT JOIN  score s ON s.for_date = a.effective_date AND s.widget_id = a.widget_id
ORDER BY a.day, a.widget_id;

But as you can see in this SQL Fiddle it produces null scores for widget 1337 on the first two days. I would like to see the earlier score of 52 from row 1 in its place.

Is it possible to do this in an efficient way?

score 1 · Answer 1 · answered Oct 18 '13 at 06:25

Like you wrote, you should find matching score, but if there is a gap - fill it with nearest earlier score. In SQL it will be:

SELECT d.day, w.widget_id, 
  coalesce(s.score, (select s2.score from score s2
    where s2.for_date<d.day and s2.widget_id=w.widget_id order by s2.for_date desc limit 1)) as score
from (select distinct widget_id FROM score) AS w
cross join (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d
left join score s ON (s.for_date = d.day AND s.widget_id = w.widget_id)
order by d.day, w.widget_id;

Coalesce in this case means "if there is a gap".

Great solution thanks, this appears to be the fastest so far for large data sets — bpaul, Oct 19 '13 at 02:22

Roman Pekar · Answer 2 · 2013-10-19T07:34:08.443

You can use distinct on syntax in PostgreSQL

with cte_d as (
    select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day
), cte_w as (
    select distinct widget_id from score
)
select distinct on (d.day, w.widget_id)
    d.day, w.widget_id, s.score
from cte_d as d
    cross join cte_w as w
    left outer join score as s on s.widget_id = w.widget_id and s.for_date <= d.day
order by d.day, w.widget_id, s.for_date desc;

or get max date by subquery:

with cte_d as (
    select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day
), cte_w as (
    select distinct widget_id from score
)
select
    d.day, w.widget_id, s.score
from cte_d as d
    cross join cte_w as w
    left outer join score as s on s.widget_id = w.widget_id
where
    exists (
        select 1
        from score as tt
        where tt.widget_id = w.widget_id and tt.for_date <= d.day
        having max(tt.for_date) = s.for_date
    )
order by d.day, w.widget_id;

The performance really depends on indexes you have on your table (unique widget_id, for_date if possible). I think if you have many rows for each widget_id then second one would be more efficient, but you have to test it on your data.

>> sql fiddle demo <<

Thanks for the answer. Select distinct seems to be the way to go but I think @Erwins solution is cleaner and more efficient. — bpaul, Oct 18 '13 at 22:50

Erwin Brandstetter · Accepted Answer · 2017-06-02T02:08:30.137

1

As @Roman mentioned, DISTINCT ON can solve this. Details in this related answer:

Select first row in each GROUP BY group?

Subqueries are generally a bit faster than CTEs, though:

SELECT DISTINCT ON (d.day, w.widget_id)
       d.day, w.widget_id, s.score
FROM   generate_series('2012-05-05'::date, '2012-05-10'::date, '1d') d(day)
CROSS  JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT   JOIN score s ON s.widget_id = w.widget_id AND s.for_date <= d.day
ORDER  BY d.day, w.widget_id, s.for_date DESC;

You can use a set returning function like a table in the FROM list.

SQL Fiddle

One multicolumn index should be the key to performance:

CREATE INDEX score_multi_idx ON score (widget_id, for_date, score)

The third column score is only included to make it a covering index in Postgres 9.2 or later. You would not include it in earlier versions.

Of course, if you have many widgets and a wide range of days, the CROSS JOIN produces a lot of rows, which has a price-tag. Only select the widgets and days you actually need.

edited Jun 02 '17 at 02:08

answered Oct 18 '13 at 15:00

Erwin Brandstetter

605,456
145
1,078
1,228

This works but seems to really slow down as number of rows increase. I have 40-50k rows and it takes over 2 mins to complete. Is it the number of records in the cross join that is slowing is down? – bpaul Oct 19 '13 at 02:01
1

@bpaul do you have indexes on your table? – Roman Pekar Oct 19 '13 at 07:34
1

@bpaul: In particular a (possibly covering) multicolumn index. I added some details. – Erwin Brandstetter Oct 19 '13 at 11:28
@RomanPekar, @Erwin currently I index on widget_id and for_date separately. I'll add the multicolumn index and report back. I'm on Postgres 9.1.10 so I'll do `widget_id, for_date`. – bpaul Oct 19 '13 at 17:49
The multicolumn index didn't help very much. I'm now caching values for the bigger queries in aggregate tables – bpaul Oct 23 '13 at 23:06

Time series querying in Postgres

3 Answers3

Linked