How to set a max range condition with timescale time_bucket_gapfill() in order to not fill real missing values?

Question

I'd like some advices to know if what I need to do is achievable with timescale functions.

I've just found out I can use time_bucket_gapfill() to complete missing data, which is amazing! I need data each 5 minutes but I can receive 10 minutes, 30 minutes or 1 hour data. So the function helps me to complete the missing points in order to have only 5 minutes points. Also, I use locf() to set the gapfilled value with last value found.

My question is: can I set a max range when I set the last value found with locf() in order to never overpass 1 hour ?

Example: If the last value found is older than 1 hour ago I don't want to fill gaps, I need to leave it empty to say we have real missing values here.

I think I'm close to something with this but apparently I'm not allowed to use locf() in the same case.

ERROR: multiple interpolate/locf function calls per resultset column not supported

Somebody have an idea how I can resolve that?

How to reproduce:

Create table powers

CREATE table powers (
    delivery_point_id BIGINT NOT NULL,
    at timestamp NOT NULL,
    value BIGINT NOT NULL
);

Create hypertable

SELECT create_hypertable('powers', 'at');

Create indexes

CREATE UNIQUE INDEX idx_dpid_at ON powers(delivery_point_id, at);
CREATE INDEX index_at ON powers(at);

Insert data for one day, one delivery point, point 10 minutes

INSERT INTO powers SELECT 1, at, round(random()*10000) FROM generate_series(TIMESTAMP '2021-01-01 00:00:00', TIMESTAMP '2022-01-02 00:00:00', INTERVAL '10 minutes') AS at;

Remove three hours of data from 4am to 7am

DELETE FROM powers WHERE delivery_point_id = 1  AND at < '2021-01-1 07:00:00'  AND at > '2021-01-01 04:00:00';

The query that need to be fixed

SELECT
    time_bucket_gapfill('5 minutes', at) AS point_five,
    avg(value) AS avg,
    CASE
        WHEN (locf(at) - at) > interval '1 hour' THEN null
        ELSE locf(avg(value))
        END AS gapfilled
FROM powers
GROUP BY point_five, at
ORDER BY point_five;

Actual: ERROR: multiple interpolate/locf function calls per resultset column not supported

Expected: Gapfilled values each 5 minutes except between 4am and 7 am (real missing values).

score 2 · Answer 1 · answered Jan 16 '23 at 16:26

This is a great question! I'm going to provide a workaround for how to do this with the current stuff, but I think it'd be great if you'd open a Github issue as well, because there might be a way to add an option for this that doesn't require a workaround like this.

I also think your attempt was a good approach and just requires a few tweaks to get it right!

The error that you're seeing is that we can't have multiple locf calls in a single column, this is a limitation that's pretty easy to work around as we can just shift both of them into a subquery, but that's not enough. The other thing that we need to change is that locf only works on aggregates, right now, you’re trying to use it on a column (at) that isn’t aggregated, which isn’t going to work, because it wouldn’t know which of the values of at in a time_bucket to “pull forward” for the gapfill.

Now you said you want to fill data as long as the previous point wasn’t more than one hour ago, so, we can take the last value of at in the bucket by using last(at, at) this is also the max(at) so either of those aggregates would work. So we put that into a CTE (common table expression or WITH query) and then we do the case statement outside like so:

WITH filled as (SELECT
                    time_bucket_gapfill('5 minutes', at) AS point_five,
                    avg(value) AS avg,
                    locf(last(at, at)) as filled_from,
                    locf(avg(value)) as filled_avg

                FROM powers
                WHERE at BETWEEN '2021-01-01 01:30:00' AND '2021-01-01 08:30:00'
                  AND delivery_point_id = 1
                GROUP BY point_five
                ORDER BY point_five)
SELECT point_five,
       avg,
       filled_from,
       CASE WHEN point_five - filled_from > '1 hour'::interval THEN NULL
           ELSE filled_avg
           END as gapfilled
FROM filled;

Note that I’ve tried to name my CTE expressively so that it’s a little easier to read!

Also, I wanted to point out a couple other hyperfunctions that you might think about using:

heartbeat_agg is a new/experimental one that will help you determine periods when your system is up or down, so if you're expecting points at least every hour, you can use it to find the periods where the delivery point was down or the like.
When you have more irregular sampling or want to deal with different data frequencies from different delivery points, I’d take a look a the time_weight family of functions. They can be more efficient than using something like gapfill to upsample, by instead letting you treat all the different sample rates similarly, without having to create more points and more work to do so. Even if you want to, for instance, compare sums of values, you’d use something like integral to get the time weighted sum over a period based on the locf interpolation.

Anyway, hope all that is helpful!

Thanks a lot @davidk for your precious help and for proponing various solutions. The query using `last()` works! Maybe an offset on the `time_bucket_gapfill()` function would be even more appropriate. I saw there are already a few issues about that point on GitHub. I will keep an eye on it if the tsdb team plan to implement it in the next months. — CamSim, Jan 17 '23 at 10:36

How to set a max range condition with timescale time_bucket_gapfill() in order to not fill real missing values?

1 Answers1