Efficient time series querying in Postgres

Question

I have a table in my PG db that looks somewhat like this:

id | widget_id | for_date | score |

Each referenced widget has a lot of these items. It's always 1 per day per widget, but there are gaps.

What I want to get is a result that contains all the widgets for each date since X. The dates are brought in via generate series:

 SELECT date.date::date
   FROM generate_series('2012-01-01'::timestamp with time zone,'now'::text::date::timestamp with time zone, '1 day') date(date)
 ORDER BY date.date DESC;

If there is no entry for a date for a given widget_id, I want to use the previous one. So say widget 1337 doesn't have an entry on 2012-05-10, but on 2012-05-08, then I want the resultset to show the 2012-05-08 entry on 2012-05-10 as well:

Actual data:
widget_id | for_date   | score
1312      | 2012-05-07 | 20
1337      | 2012-05-07 | 12
1337      | 2012-05-08 | 41
1337      | 2012-05-11 | 500

Desired output based on generate series:
widget_id | for_date   | score
1336      | 2012-05-07 | 20
1337      | 2012-05-07 | 12
1336      | 2012-05-08 | 20
1337      | 2012-05-08 | 41
1336      | 2012-05-09 | 20
1337      | 2012-05-09 | 41
1336      | 2012-05-10 | 20
1337      | 2012-05-10 | 41
1336      | 2012-05-11 | 20
1337      | 2012-05-11 | 500

Eventually I want to boil this down into a view so I have consistent data sets per day that I can query easily.

Edit: Made the sample data and expected resultset clearer

Please clarify: do you want to list rows for *all* widget_id's at once (like your text at the top implies) or for *a given* `widget_id` (like your example implies)? — Erwin Brandstetter, Feb 14 '13 at 15:44
Why is the widget_id=1312 missing in the desired output? And where does the widget_id=1336 come from ? — wildplasser, Feb 14 '13 at 19:32
Follow-up with solution for missing values: http://stackoverflow.com/questions/19442362/time-series-querying-in-postgres/19452830 — Erwin Brandstetter, Oct 18 '13 at 15:02

Clodoaldo Neto · Accepted Answer · 2013-02-14T13:39:18.930

SQL Fiddle

select
    widget_id,
    for_date,
    case
        when score is not null then score
        else first_value(score) over (partition by widget_id, c order by for_date)
        end score
from (
    select
        a.widget_id,
        a.for_date,
        s.score,
        count(score) over(partition by a.widget_id order by a.for_date) c
    from (
        select widget_id, g.d::date for_date
        from (
            select distinct widget_id
            from score
            ) s
            cross join
            generate_series(
                (select min(for_date) from score),
                (select max(for_date) from score),
                '1 day'
            ) g(d)
        ) a
        left join
        score s on a.widget_id = s.widget_id and a.for_date = s.for_date
) s
order by widget_id, for_date

score 7 · Answer 2 · edited Jun 20 '20 at 09:12

7

First of all, you can have a much simpler generate_series() table expression. Equivalent to yours (except for descending order, that contradicts the rest of your question anyways):

SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date

The type date is coerced to timestamptz automatically on input. The return type is timestamptz either way. I use a subquery below, so I can cast to the output to date right away.

Next, max() as window function returns exactly what you need: the highest value since frame start ignoring NULL values. Building on that, you get a radically simple query.

For a given widget_id

Most likely faster than involving CROSS JOIN or WITH RECURSIVE:

SELECT a.day, s.*
FROM  (
   SELECT d.day
         ,max(s.for_date) OVER (ORDER BY d.day) AS effective_date
   FROM  (
      SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date
      ) d(day)
   LEFT   JOIN score s ON s.for_date = d.day
                      AND s.widget_id = 1337 -- "for a given widget_id"
   ) a
LEFT   JOIN score s ON s.for_date = a.effective_date
                   AND s.widget_id = 1337
ORDER  BY a.day;

->sqlfiddle

With this query you can put any column from score you like into the final SELECT list. I put s.* for simplicity. Pick your columns.

If you want to start your output with the first day that actually has a score, simply replace the last LEFT JOIN with JOIN.

Generic form for all widget_id's

Here I use a CROSS JOIN to produce a row for every widget on every date ..

SELECT a.day, a.widget_id, s.score
FROM  (
   SELECT d.day, w.widget_id
         ,max(s.for_date) OVER (PARTITION BY w.widget_id
                                ORDER BY d.day) AS effective_date
   FROM  (SELECT generate_series('2012-05-05'::date
                                ,'2012-05-15'::date, '1d')::date AS day) d
   CROSS  JOIN (SELECT DISTINCT widget_id FROM score) AS w
   LEFT   JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
   ) a
JOIN  score s ON s.for_date = a.effective_date
             AND s.widget_id = a.widget_id  -- instead of LEFT JOIN
ORDER BY a.day, a.widget_id;

->sqlfiddle

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 14 '13 at 14:51

Erwin Brandstetter

605,456
145
1,078
1,228

1

Would your cross join less answer work for multiple widget_ids? [SQL Fiddle](http://www.sqlfiddle.com/#!12/686f3/1). From the question: _What I want to get is a result that contains all the widgets for each date since X_ – Clodoaldo Neto Feb 14 '13 at 15:25
@Clodoaldo: For multiple `widget_id`'s at once, you would have to adapt. I'll have a look at a more generic query. But, also from the question: *for a given widget_id*. – Erwin Brandstetter Feb 14 '13 at 15:39
I like this approach best so far, I was fiddling in a similar direction but failed to put the window functions and joins together. @Clodoaldo is right though, I would like this to also be queryable as a whole, so for example I'd like to ask "what was the average (or max, min, ...) score on date X", which would require something that does not depend on explicit widget ids – TheDeadSerious Feb 14 '13 at 17:00
@TheDeadSerious: I added a generic solution to show a row for every `(day, widget)`. Your question could have been more clear on that (your test case, particularly). For aggregates like `max()` or `min()` there is probably a more elegant way! Depends on what you need exactly and is a different question that shall go into a different question. – Erwin Brandstetter Feb 14 '13 at 17:04
And @Clodoaldo is right, of course, that a `CROSS JOIN` is the way to go to produce the raster for the generic solution. – Erwin Brandstetter Feb 14 '13 at 17:06
Sorry, I see that I should have been more specific about this. I thought the suffix "Eventually I want to boil this down into a view so I have consistent data sets per day that I can query easily." would make it clear I'd like a full data set in the results. Thanks for your effort! – TheDeadSerious Feb 14 '13 at 18:38
Great solution. I've been using a variation on this for a project but have encountered an issue when the "previous" value falls before the time period you are looking at. Check out this [SQL Fiddle](http://www.sqlfiddle.com/#!12/d7a20/1/0) for the simplest example. Is there a way to adjust for this? Thanks! – bpaul Oct 17 '13 at 00:39
@bpaul: Like in my first example, you can use a `LEFT JOIN` ([SQLfiddle](http://www.sqlfiddle.com/#!12/d7a20/3)). If you want something else I suppose you start a *new question* where you provide all details necessary. You can always refer to this one for context. – Erwin Brandstetter Oct 17 '13 at 12:00
@ErwinBrandstetter: Thanks for the reply I posted a new question http://stackoverflow.com/questions/19442362/time-series-querying-in-postgres can you take a look? – bpaul Oct 18 '13 at 05:40

score 2 · Answer 3 · answered Feb 14 '13 at 13:23

Using your table structure, I created the following Recursive CTE which starts with your MIN(For_Date) and increments until it reaches the MAX(For_Date). Not sure if there is a more efficient way, but this appears to work well:

WITH RECURSIVE nodes_cte(widgetid, for_date, score) AS (
-- First Widget Using Min Date
 SELECT 
    w.widgetId, 
    w.for_date, 
    w.score
 FROM widgets w 
  INNER JOIN ( 
      SELECT widgetId, Min(for_date) min_for_date
      FROM widgets
      GROUP BY widgetId
   ) minW ON w.widgetId = minW.widgetid 
        AND w.for_date = minW.min_for_date
UNION ALL
 SELECT 
    n.widgetId,
    n.for_date + 1 for_date,
    coalesce(w.score,n.score) score
 FROM nodes_cte n
  INNER JOIN (
      SELECT widgetId, Max(for_date) max_for_date
      FROM widgets 
      GROUP BY widgetId
   ) maxW ON n.widgetId = maxW.widgetId
  LEFT JOIN widgets w ON n.widgetid = w.widgetid 
    AND n.for_date + 1 = w.for_date
  WHERE n.for_date + 1 <= maxW.max_for_date
)
SELECT * 
FROM nodes_cte 
ORDER BY for_date

Here is the SQL Fiddle.

And the returned results (format the date however you'd like):

WIDGETID   FOR_DATE                     SCORE
1337       May, 07 2012 00:00:00+0000   12
1337       May, 08 2012 00:00:00+0000   41
1337       May, 09 2012 00:00:00+0000   41
1337       May, 10 2012 00:00:00+0000   41
1337       May, 11 2012 00:00:00+0000   500

Please note, this assumes your For_Date field is a Date -- if it includes a Time -- then you may need to use Interval '1 day' in the query above instead.

Hope this helps.

score 0 · Answer 4 · answered Feb 14 '13 at 19:42

The data:

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

CREATE TABLE widget
        ( widget_id INTEGER NOT NULL
        , for_date DATE NOT NULL
        , score INTEGER
         , PRIMARY KEY (widget_id,for_date)
        );
INSERT INTO widget(widget_id , for_date , score) VALUES
 (1312, '2012-05-07', 20)
, (1337, '2012-05-07', 12)
, (1337, '2012-05-08', 41)
, (1337, '2012-05-11', 500)
        ;

The query:

SELECT w.widget_id AS widget_id
        , cal::date AS for_date
        -- , w.for_date AS org_date
        , w.score AS score
FROM generate_series( '2012-05-07'::timestamp , '2012-05-11'::timestamp
                 , '1day'::interval) AS cal
        -- "half cartesian" Join;
        -- will be restricted by the NOT EXISTS() below
LEFT JOIN widget w ON w.for_date <= cal
WHERE NOT EXISTS (
        SELECT * FROM widget nx
        WHERE nx.widget_id = w.widget_id
        AND nx.for_date <= cal
        AND nx.for_date > w.for_date
        )
ORDER BY cal, w.widget_id
        ;

The result:

 widget_id |  for_date  | score 
-----------+------------+-------
      1312 | 2012-05-07 |    20
      1337 | 2012-05-07 |    12
      1312 | 2012-05-08 |    20
      1337 | 2012-05-08 |    41
      1312 | 2012-05-09 |    20
      1337 | 2012-05-09 |    41
      1312 | 2012-05-10 |    20
      1337 | 2012-05-10 |    41
      1312 | 2012-05-11 |    20
      1337 | 2012-05-11 |   500
(10 rows)

Efficient time series querying in Postgres

4 Answers4

For a given widget_id

Generic form for all widget_id's

Linked