yet another date gap-fill SQL puzzle

Question

I'm using Vertica, which precludes me from using CROSS APPLY, unfortunately. And apparently there's no such thing as CTEs in Vertica.

Here's what I've got:

t:
    day    | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 |  1 | 10     | 10
2011-12-03 |  1 | 12     | 2
2011-12-04 |  1 | 15     | 3

Note that on the first day, the delta is equal to the metric value. I'd like to fill in the gaps, like this:

t_fill:
    day    | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 |  1 | 10     | 10
2011-12-02 |  1 | 10     | 0 -- a delta of 0
2011-12-03 |  1 | 12     | 2
2011-12-04 |  1 | 15     | 3

I've thought of a way to do this day by day, but what I'd really like is a solution that works in one go.

I think I could get something working with LAST_VALUE, but I can't come up with the right JOIN statements that will let me properly partition and order on each id's day-by-day history.

edit: assume I have a table like this:

calendar:
    day 
------------
2011-01-01
2011-01-02
   ...

that can be involved with joins. My intent would be to maintain the date range in calendar to match the date range in t.

edit: A few more notes on what I'm looking for, just to be specific:

In generating t_fill, I'd like to exactly cover the date range in t, as well as any dates that are missing in between. So a correct t_fill will start on the same date and end on the same date as t. t_fill has two properties:

1) once an id appears on some date, it will always have a row for each later date. This is the gap-filling implied in the original question.

2) Should no row for an id ever appear again after some date, the t_fill solution should merrily generate rows with the same metric value (and 0 delta) from the date of that last data point up to the end date of t.

A solution might backfill earlier dates up to the start of the date range in t. That is, for any id that appears after the first date in t, rows between the first date in t and the first date for the id will be filled with metric=0 and d_metric=0. I don't prefer this kind of solution, since it has a higher growth factor for each id that enters the system. But I could easily deal with it by selecting into a new table only rows where metric!=0 and d_metric!=0.

I know nothing about Vertica, but, as they say on [their site](http://www.vertica.com/the-analytics-platform/advanced-in-database-analytics/), *‘Some of the Vertica Advanced Analytics features include: Native gap filling, …’*. So, if I were you, I should probably have started investigating in that direction. — Andriy M, Dec 27 '11 at 11:55
We have tried their GFI feature in the past, but it has a very specific bent that doesn't suit our needs here. — kimbo305, Dec 27 '11 at 22:29
What specifically doesn't meet your needs? It seems that gap-filling with constant interpolation would be the way to go. — geoffrobinson, Dec 31 '11 at 05:52
@geoff I clarified why I thought gap-filling that was focused on interpolation _between_ data points would not be sufficient in my comments here: http://stackoverflow.com/a/8714804/1116963 — kimbo305, Jan 04 '12 at 00:12

score 1 · Answer 1 · answered Jan 03 '12 at 16:56

This about what Jonathan Leffler proposed, but into old-fashioned low-level SQL (without fancy CTE's or window functions or aggregating subqueries):

SET search_path='tmp'
DROP TABLE ttable CASCADE;
CREATE TABLE ttable
        ( zday date NOT NULL
        , id INTEGER NOT NULL
        , metric INTEGER NOT NULL
        , d_metric INTEGER NOT NULL
        , PRIMARY KEY (id,zday)
        );
INSERT INTO ttable(zday,id,metric,d_metric) VALUES
 ('2011-12-01',1,10,10)
,('2011-12-03',1,12,2)
,('2011-12-04',1,15,3)
        ;

DROP TABLE ctable CASCADE;
CREATE TABLE ctable
        ( zday date NOT NULL
        , PRIMARY KEY (zday)
        );
INSERT INTO ctable(zday) VALUES
 ('2011-12-01')
,('2011-12-02')
,('2011-12-03')
,('2011-12-04')
        ;

CREATE VIEW v_cte AS (
        SELECT t.zday,t.id,t.metric,t.d_metric
        FROM ttable t
        JOIN ctable c ON c.zday = t.zday
        UNION
        SELECT c.zday,t.id,t.metric, 0
        FROM ctable c, ttable t
        WHERE t.zday < c.zday
        AND NOT EXISTS ( SELECT *
                FROM ttable nx
                WHERE nx.id = t.id
                AND nx.zday = c.zday
                )
        AND NOT EXISTS ( SELECT *
                FROM ttable nx
                WHERE nx.id = t.id
                AND nx.zday < c.zday
                AND nx.zday > t.zday
                )
        )
        ;
SELECT * FROM v_cte;

The results:

    zday    | id | metric | d_metric 
------------+----+--------+----------
 2011-12-01 |  1 |     10 |       10
 2011-12-02 |  1 |     10 |        0
 2011-12-03 |  1 |     12 |        2
 2011-12-04 |  1 |     15 |        3
(4 rows)

I finally got a chance to implement this. I believe it works, but the scale is not great. The source table t has about 500mil rows in it, and all those SELECTs in NOT EXISTs are a big pain. — kimbo305, Jan 07 '12 at 05:00
Don't speak too early: `not exists` subqueries are usually handled surprisingly efficient (probably because yhe construct is rather old). But I don't know about your platform/implementation. — wildplasser, Jan 07 '12 at 09:44
I'll take up the query plan presented to me with Vertica Support. — kimbo305, Jan 11 '12 at 07:45

score 0 · Answer 2 · edited May 23 '17 at 12:12

0

I am not Vertica user, but if you do not want to use their native support for GAP fillings, here you can find a more generic SQL-only solution to do so.

edited May 23 '17 at 12:12

Community

1
1

answered Jan 02 '12 at 18:35

simonecampora

397
2
8

I saw that one and tried to adapt it before starting this answer. The problem that I ran into is that the LEFT JOIN doesn't help me fill out the id values. The table in that link basically hardcodes for a single id, so the LEFT JOIN is still viable. – kimbo305 Jan 03 '12 at 05:41

score 0 · Answer 3 · answered Jan 03 '12 at 15:49

0

If you want to use something like a CTE, how about using a temporary table? Essentially, a CTE is a view for a particular query.

Depending on your needs you can make the temporary table transaction or session-scoped.

I'm still curious to know why gap-filling with constant-interpolation wouldn't work here.

answered Jan 03 '12 at 15:49

geoffrobinson

1,580
3
15
23

I think I'll answer this first, as a first step toward answering Jonathan's question. There's two kinds of date ranges with which we are concerned. – kimbo305 Jan 03 '12 at 23:55
e (goofed the comment): The first is the date range of the query against the new table we're seeking to build. The second is the date range for each id in the original metric table **t**. An id's metric history can start well after the first day for which we have any data. Sometimes, it can also stop. And of course, as we've seen, it can have missing days in between. The date range of a query can miss out entirely on some ids. When this happens, gap-filling as implemented as a TIMESERIES in Vertica doesn't do anything. Since there's no data to work with in the captured date range. – kimbo305 Jan 03 '12 at 23:59

score 0 · Answer 4 · answered Jan 03 '12 at 16:00

0

Given the complete calendar table, it is doable, though not exactly trivial. Without the calendar table, it would be a lot harder.

Your query needs to be stated moderately precisely, which is usually half the battle in any issue with 'how to write the query'. I think you are looking for:

For each date in Calendar between the minimum and maximum dates represented in T (or other stipulated range),
For each distinct ID represented in T,
Find the metric for the given ID for the most recent record in T on or before the date.

This gives you a complete list of dates with metrics.

You then need to self-join two copies of that list with dates one day apart to form the deltas.

Note that if some ID values don't appear at the start of the date range, they won't show up.

With that as guidance, you should be able get going, I believe.

answered Jan 03 '12 at 16:00

Jonathan Leffler

730,956
141
904
1,278

> Note that if some ID values don't appear at the start of the date range, they won't show up. This is exactly why we're looking to create the new table from **t**. – kimbo305 Jan 04 '12 at 00:02
The problem is that you've got nothing to work on for an ID that first appears 3 days after the start. Do you want to have zeroes for such entries? – Jonathan Leffler Jan 04 '12 at 00:03
e: dang. So bad at not hitting enter in the comments. The goal of this gap-filling isn't merely the gaps, but also to populate each id's history from the last row/date for which it has data down to the last date for the query: http://stackoverflow.com/questions/8640626/yet-another-date-gap-fill-sql-puzzle/8714804#8714804 As you can see, with a naive query against **t**, any ids whose history ends before the start date of the query won't show up at all. That's the problem with using the Vertica TIMESERIES solution as well as the problem with (my interpretation of) the caveat I quoted above. – kimbo305 Jan 04 '12 at 00:09
For sure, not all of the ids are present for the first date in **t**. I can't square the bit I quoted against your procedure, though :(. I just am not equipped to think in SQL. – kimbo305 Jan 04 '12 at 00:09
I'm not quite sure about my procedure this afternoon; it was written in modest haste this morning. What I'm not sure about is that your table **t** seems to have *d_metric* values for the first row in an ID. You can extend beyond (later than) the days in **t** easily enough, replicating the *metric* value and setting *d_metric* to zero for each day. Extending backwards (earlier) is not so obvious; probably you want *metric = 0* and *d_metric = 0*. Unfortunately, your question didn't say all this - and there are multiple possible ways to do this stuff, and if you don't tell us what you want... – Jonathan Leffler Jan 04 '12 at 00:14
1

Yes, I definitely need to be more precise. Let me try: In generating **t_fill**, I'd like to duplicate all available dates in **t**. So **t_fill** will start on the same date and end on the same date. **t_fill** has two properties. 1) once an id appears on some date, it will always have a row for each later date. This is the gap-filling implied in the original question. 2) Should no row for an id ever appear again after some date, **t_fill** should merrily generate rows with the same metric value (and 0 delta) from the date of that last data point to the end date of **t**. Continued... – kimbo305 Jan 04 '12 at 00:22
To answer your original point, yes, in our table, on the row of the first date for an id, metric and d_metric are equal. If an id's history starts on date d1, then yes, logically, its values for dates – kimbo305 Jan 04 '12 at 00:29
Also, I edited more or less the same stuff back into the original question, in hopes that future readers will get a better picture off the bat. – kimbo305 Jan 04 '12 at 00:38

yet another date gap-fill SQL puzzle

4 Answers4