5

I have some OHLCV data stored in TimescaleDB which contains missing data during certain time ranges. This data needs to be resampled to a different time period (i.e. 1 day) and contain contiguous, ordered time buckets.

TimescaleDB provides the function time_bucket_gapfill to do this. My current query is:

SELECT 
    time_bucket_gapfill(
        '1 day', 
        "timestamp",
        '2017-07-25 00:00', 
        '2018-01-01 00:00'
    ) as date,
    FIRST(open, "timestamp") as open,
    MAX(high) as high,
    MIN(low) as low,
    LAST(close, "timestamp") as close,
    SUM(volume) as volume
FROM ohlcv
WHERE "timestamp" > '2017-07-25'
GROUP BY date ORDER BY date ASC LIMIT 10

which gives as the results

date                    open        high        low         close       volume
2017-07-25 00:00:00+00                  
2017-07-26 00:00:00+00                  
2017-07-27 00:00:00+00  0.00992     0.010184    0.009679    0.010039    65553.5299999999
2017-07-28 00:00:00+00  0.00999     0.010059    0.009225    0.009248    43049.93
2017-07-29 00:00:00+00  
2017-07-30 00:00:00+00  0.009518    0.0098      0.009286    0.009457    40510.0599999999

...

Question: It looks like only the date column has been gapfilled. By modifying the SQL statement, is it possible to also gap-fill the columns open, high, low, close and volume such that we obtain the results:

date                    open        high        low         close       volume
2017-07-25 00:00:00+00  0           0           0           0           0               
2017-07-26 00:00:00+00  0           0           0           0           0               
2017-07-27 00:00:00+00  0.00992     0.010184    0.009679    0.010039    65553.5299999999
2017-07-28 00:00:00+00  0.00999     0.010059    0.009225    0.009248    43049.93
2017-07-29 00:00:00+00  0.009248    0.009248    0.009248    0.009248    0   
2017-07-30 00:00:00+00  0.009518    0.0098      0.009286    0.009457    40510.0599999999

...

Or is it recommended to perform this data inputation after receiving the query result, such as in Python/Nodejs?


Example of how it can be done using Python/pandas

Prefer to perform this gapfilling/inputation using TimescaleDB instead of using my Nodejs app because... using Nodejs to do this will be much slower and I do not want to introduce Python into the app just to do this processing

import pandas as pd

# Building the test dataset simulating missing values after time_bucket
data = [
    (pd.Timestamp('2020-01-01'), None, None, None, None, None),
    (pd.Timestamp('2020-01-02'), 100, 110, 90, 95, 3),
    (pd.Timestamp('2020-01-03'), None, None, None, None, None),
    (pd.Timestamp('2020-01-04'), 98, 150, 100, 100, 4),
]
df = pd.DataFrame(data, columns=['date', 'open' , 'high', 'low', 'close', 'volume']).set_index('date')

#              open   high    low  close  volume
# date                                          
# 2020-01-01    NaN    NaN    NaN    NaN     NaN
# 2020-01-02  100.0  110.0   90.0   95.0     3.0
# 2020-01-03    NaN    NaN    NaN    NaN     NaN
# 2020-01-04   98.0  150.0  100.0  100.0     4.0


# Perform gap filling
df.close = df.close.fillna(method='ffill')
df.volume = df.volume.fillna(0)                 # fill missing volume with 0
df['open'] = df['open'].fillna(df['close'])     # fill missing open by forward-filling close
df['high'] = df['high'].fillna(df['close'])     # fill missing high by forward-filling close
df['low'] = df['low'].fillna(df['close'])       # fill missing low by forward-filling close
df = df.fillna(0)                               # fill missing OHLC with 0 if no previous values available

#               open   high    low  close  volume
# date                                          
# 2020-01-01    0.0    0.0    0.0    0.0     0.0
# 2020-01-02  100.0  110.0   90.0   95.0     3.0
# 2020-01-03   95.0   95.0   95.0   95.0     0.0
# 2020-01-04   98.0  150.0  100.0  100.0     4.0
Brian Burns
  • 20,575
  • 8
  • 83
  • 77
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830

2 Answers2

5
SELECT "tickerId",
       "ts",
       coalesce("open", "close")  "open",
       coalesce("high", "close")  "high",
       coalesce("low", "close")   "low",
       coalesce("close", "close") "close",
       coalesce("volume", 0)      "volume",
       coalesce("count", 0)       "count"

FROM (
     SELECT "tickerId",
            time_bucket_gapfill('1 hour', at)   "ts",
            first(price, "eId")                 "open",
            MAX(price)                          "high",
            MIN(price)                          "low",
            locf(last(price, "eId"))            "close",
            SUM(volume)                         "volume",
            COUNT(1)                            "count"
     FROM "PublicTrades"
     WHERE at >= date_trunc('day', now() - INTERVAL '1 year')
       AND at < NOW()
     GROUP BY "tickerId", "ts"
     ORDER BY "tickerId", "ts" DESC
     LIMIT 100
 ) AS P

Notice: eId is Exchange Public Trade ID

Brian Burns
  • 20,575
  • 8
  • 83
  • 77
Anton Korneychuk
  • 632
  • 1
  • 7
  • 12
  • 3
    One more notice: I found using `NOW()` makes queries to hyper-tables much slower. It'll work faster if you passing the pre-generated dates as strings to that kind of queries. I didn't investigate the reasons for it. Tested on Postgres 12 & TimsScaleDB 1.7.0 – Anton Korneychuk May 28 '20 at 00:46
  • This is the only thing that worked perfectly for me. Thank you @anton. – Rajesh Jul 30 '21 at 07:45
2

You need to specify in each column how to perform the gapfilling. My guess is that you probably want to use locf. See:

https://docs.timescale.com/latest/api#time_bucket_gapfill https://docs.timescale.com/latest/api#locf

Mike Freedman
  • 1,692
  • 9
  • 9
  • 1
    Is it possible to fill all the 4 columns `open`, `high`, `low`, `close` in the rows that have been 'gapfilled' by using a single value from the `close` column in the previous row? Also the `volume` column should always be filled with `0`. Currently I can only figure out how to use `locf` to forward fill the missing values using previous values from the same column and not from a different column, nor using a fixed value like `0`. – Nyxynyx Feb 17 '20 at 05:32
  • If using Python's `pandas` to describe what I am trying to achieve, we can do `df.volume.fillna(0)` followed by `df.open = df.open.fillna(df.close)`. Updated question with this example – Nyxynyx Feb 17 '20 at 05:57