1

I'm trying to build a pandas df with an SQL postgres timescale db query and this works below but the problem is the query builds an enourmous file for the one point I am after /5231/REGCHG total_power. The data has been recording for a few years on 1 minute intervals and I can get it all ordered nicely by the ORDER BY minute but I dont need that much data. I only need to go back 3 days from the current day.

Im trying to average the data by 15 minute increments so 3 days in 15 minute increments is the last 288 rows of this query below. Would a quick hack be just returning the tail(288) of this query if possible in SQL? Not a lot of wisdom here any tips appreciated.

I was also trying to use the #ORDER BY minute limit 3000 which I have commented out because it would only limit the first 3000 entries or when the data first started recording I need the last entries not the first hopefully this makes sense.

import psycopg2
from sqlalchemy import create_engine
import pandas as pd
import datetime


SQL_PASS = 'asdf'
SQL_USER = 'asf'
SQL_HOST = '10.10.10.10'
dsn = f'postgresql://{SQL_USER}:{SQL_PASS}@{SQL_HOST}:5432/my_db'
dbengine = create_engine(dsn)

qry = '''
SELECT
  time_bucket('15 minute', "time") AS minute,
  metric,
  avg(value)
FROM slipstream_volttron
WHERE metric LIKE '/5231/REGCHG total_power' 
GROUP BY minute, metric
ORDER BY minute
'''
#ORDER BY minute limit 3000

dbconn = psycopg2.connect(host=SQL_HOST, dbname='my_db', user=SQL_USER, password=SQL_PASS, connect_timeout=5)


t1 = datetime.datetime.utcnow()
df = pd.read_sql(qry, dbconn, index_col='minute', parse_dates=True)
t2 = datetime.datetime.utcnow()

duration = t2 - t1
seconds = duration.total_seconds()

hours = seconds // 3600
minutes = (seconds % 3600) // 60
seconds = seconds % 60

elapsed_time = f' db retrieval time is {minutes} minutes, {seconds} seconds'

print(df.columns)
print(df.head())
print(df.describe())

df.to_csv('main_meter_data.csv')
dbconn.close()

EDIT

In Pandas when I print the dataframe I am noticing that the SQL query isn't most recent data. For example in my script if I add in a print of time and the df:

df = pd.read_sql(qry, dbconn, index_col='minute', parse_dates=True)
print(time.ctime())
print("df: ",df)

This will return the current time is 9:13 AM but the request of data drops off 2023-01-27 15:00:00 34898.357143 which is way out into the future...is this UTC time?:

Fri Jan 27 09:13:34 2023
df:                                avg
minute                           
2023-01-24 15:15:00  35075.933333
2023-01-24 15:30:00  33908.866667
2023-01-24 15:45:00  33375.666667
2023-01-24 16:00:00  34529.133333
2023-01-24 16:15:00  34748.533333
...                           ...
2023-01-27 14:00:00  35977.200000
2023-01-27 14:15:00  35458.333333
2023-01-27 14:30:00  37695.933333
2023-01-27 14:45:00  36792.000000
2023-01-27 15:00:00  34898.357143
bbartling
  • 3,288
  • 9
  • 43
  • 88

2 Answers2

2

I need the last entries not the first

So use DESCENDING sort order:

SELECT time_bucket('15 minute', "time") AS minute
     , avg(value)
FROM  (
   SELECT time, value
   FROM   slipstream_volttron
   WHERE  metric = '/5231/REGCHG total_power' 
   ORDER  BY time DESC  -- !!!
   LIMIT  4320  -- 24 * 60 * 3
   ) sub
GROUP  BY minute
ORDER  BY minute DESC;

And limit in a subquery before you aggregate. This way, an index on (metric, time) gives you a (bitmap) index scan. Append INCLUDE (value) to the index if that gives you index-only scans. See:

And maybe add something like

WHERE time > '2023-1-1’

to the index to make it a very small partal index. And mirror rhe same filter in the query (logically redundantly) to match the index.

Since Postgres 14 you can also use date_bin() from core Postgres instead of time_bucket(). See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I made an edit to my post about a question regarding if the SQL time series data is in UTC time? Any chance for a comment on my question greatly appreciated about the `EDIT` – bbartling Jan 27 '23 at 15:19
  • @bbartling: Hard to comment on that. Depends on actual datatypes in use, actual data in your table, and possibly the time zone setting of your session. Start a new question disclosing relevant details. Basics: https://stackoverflow.com/a/9576170/939860 – Erwin Brandstetter Jan 30 '23 at 15:31
1

As suggested above, include a where clause on the time column:

WHERE time > now() - interval '3 days'

TimescaleDB will be able to use this predicate to more efficiently truncate which chunks it plans/ executes against.

Mike Freedman
  • 1,692
  • 9
  • 9