0

I have timeseries data that has up to millisecond accuracy. Some of these timestamps can coincide on the exact time which can therefore be sorted out by a database id column to figure out which is the latest.

I am trying to use Timescale to get the latest values per second. Here is an example of the data I'm looking at

time                     db_id  value
2020-01-01 08:39:23.293 | 4460 | 136.01 | 
2020-01-01 08:39:23.393 | 4461 | 197.95 | 
2020-01-01 08:40:38.973 | 4462 |  57.95 | 
2020-01-01 08:43:01.223 | 4463 |    156 | 
2020-01-01 08:43:26.577 | 4464 | 253.43 | 
2020-01-01 08:43:26.577 | 4465 |  53.68 | 
2020-01-01 08:43:26.577 | 4466 | 160.00 | 

When obtaining latest price per second, my results should look like this

time                 value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:39:24 | 197.95 |
.
.
.
2020-01-01 08:40:37 | 197.95 |
2020-01-01 08:40:38 | 57.95  |
2020-01-01 08:40:39 | 57.95  |
.
.
.
2020-01-01 08:43:25 | 57.95  | 
2020-01-01 08:43:26 | 160.00 |  
2020-01-01 08:43:27 | 160.00 |
.
.
.

I've successfully obtained the latest results per second using the Timescale time_bucket

SELECT last(value, db_id), time_bucket('1 seconds', time) AS per_second FROM timeseries GROUP BY per_second ORDER BY per_second DESC;

but it leaves holes in the time column.

time                 value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95  | 
2020-01-01 08:43:26 | 160.00 |  

The solution I thought up of is creating a database with per second timestamps and null values, migrating data from the previous resulting table and then replacing the null values with last occurring value but it seems like a lot of intermediary steps.

I'd like to know if there is a better approach to this issue of finding the "latest value" per second, minute, hour etc. I originally tried solving the issue with python as it seemed like a simple issue but it took up a lot of computing time.

Kafkaesque
  • 37
  • 3
  • 1
    Hello, thanks for the detailed question _for transparency I work for Timescale_ I don't have the direct answer to the above but wanted to share this very recent video where one of our developer advocates addresses finding the latest value and reviews a number of options https://www.youtube.com/watch?v=HwJrmYJoIw4 – greenweeds Dec 22 '21 at 10:06
  • 1
    Hello again! A colleague just mentioned that they think time bucket gapfill might be one of the functions that you'd find valuable in this scenario, it would be worth you exploring that in the Timescale docs. – greenweeds Dec 22 '21 at 13:20
  • 1
    Thanks for your response @greenweeds! Glad to know that my problem is not as trivial as I thought :) I'll definitely check out the video and time_bucket_gapfill(). – Kafkaesque Dec 23 '21 at 02:31

1 Answers1

0

Found a nice working solution to my problem. It involves four main steps:

  1. getting latest values
    select 
        time_bucket('1 second', time + '1 second') as interval,
        last(val, db_id) as last_value
    from table
    where time  > <date_start> and time < <date_end>
    group by interval
    order by time;

This will produce a table that has the latest values. last also takes advantage of a column in case another level of sorting is required. e.g.

time                 last_value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95  | 
2020-01-01 08:43:26 | 160.00 | 

Note that I shift the time by one second with + '1 second' since I only want data before a particular second - without this it will consider on-the-second data as part of the last price.

  1. creating a table with timestamps per second
    select 
        time_bucket_gapfill('1 second', time) as per_second
    from table
    where time  > <date_start> and time < <date_end>
    group by per_second
    order by per_second;

Here I produce a table where each row has per second timestamps.

e.g.

per_second
2020-01-01 00:00:00.000
2020-01-01 00:00:01.000
2020-01-01 00:00:02.000
2020-01-01 00:00:03.000
2020-01-01 00:00:04.000
2020-01-01 00:00:05.000
  1. join them together and add a value_partition column
select
    per_second,
    last_value,
    sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
from
    (
        select 
            time_bucket('1 second', time + '1 second') as interval,
            last(val, db_id) as last_value
        from table
        where time  > <date_start> and time < <date_end>
        group by interval, time
    ) a
right join
    (
        select 
            time_bucket_gapfill('1 second', time) as per_second
        from table
        where time  > <date_start> and time < <date_end>
        group by per_second
    ) b
on a.interval = b.per_second

Inspired by this answer, the goal is to have a counter (value_partition) that increments only if the value is not null.

e.g.

per_second              latest_value value_partition
2020-01-01 00:00:00.000 NULL         0         
2020-01-01 00:00:01.000 15.82        1         
2020-01-01 00:00:02.000 NULL         1         
2020-01-01 00:00:03.000 NULL         1         
2020-01-01 00:00:04.000 NULL         1         
2020-01-01 00:00:05.000 NULL         1         
2020-01-01 00:00:06.000 NULL         1         
2020-01-01 00:00:07.000 NULL         1         
2020-01-01 00:00:08.000 NULL         1         
2020-01-01 00:00:09.000 NULL         1         
2020-01-01 00:00:10.000 15.72        2 
2020-01-01 00:00:10.000 14.67        3         

  1. filling in the null values
select
    per_second,
    first_value(last_value) over (partition by value_partition order by per_second) as latest_value
from
(
    select
        per_second,
        last_value,
        sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
    from
    (
            select 
                time_bucket('1 second', time + '1 second') as interval,
                last(val, db_id) as last_value
            from table
            where time  > <date_start> and time < <date_end>
            group by interval
        ) a
    right join
        (
            select 
                time_bucket_gapfill('1 second', time) as per_second
            from table
            where time  > <date_start> and time < <date_end>
            group by per_second
        ) b
    on a.interval = b.per_second
) as q

This final step brings everything together. This takes advantage of the value_partition column and overwrites the null values accordingly.

e.g.

per_second              latest_value
2020-01-01 00:00:00.000 NULL        
2020-01-01 00:00:01.000 15.82       
2020-01-01 00:00:02.000 15.82       
2020-01-01 00:00:03.000 15.82       
2020-01-01 00:00:04.000 15.82       
2020-01-01 00:00:05.000 15.82       
2020-01-01 00:00:06.000 15.82       
2020-01-01 00:00:07.000 15.82       
2020-01-01 00:00:08.000 15.82       
2020-01-01 00:00:09.000 15.82       
2020-01-01 00:00:10.000 15.72       
2020-01-01 00:00:10.000 14.67               

Kafkaesque
  • 37
  • 3