3

I'm running Postgres 9.2, and have a table of temperatures and timestamps, one timestamp per minute in millisecond epoch time:

weather=# \d weather_data
      Table "public.weather_data"
   Column    |     Type     | Modifiers 
-------------+--------------+-----------
 timestamp   | bigint       | not null
 sensor_id   | integer      | not null
 temperature | numeric(4,1) | 
 humidity    | integer      | 
 date        | date         | not null
Indexes:
    "weather_data_pkey" PRIMARY KEY, btree ("timestamp", sensor_id)
    "weather_data_date_idx" btree (date)
    "weather_data_humidity_idx" btree (humidity)
    "weather_data_sensor_id_idx" btree (sensor_id)
    "weather_data_temperature_idx" btree (temperature)
    "weather_data_time_idx" btree ("timestamp")
Foreign-key constraints:
    "weather_data_sensor_id_fkey" FOREIGN KEY (sensor_id) REFERENCES weather_sensors(sensor_id)

weather=# select * from weather_data order by timestamp desc;
   timestamp   | sensor_id | temperature | humidity |    date    
---------------+-----------+-------------+----------+------------
 1483272420000 |         2 |        22.3 |       57 | 2017-01-01
 1483272420000 |         1 |        24.9 |       53 | 2017-01-01
 1483272360000 |         2 |        22.3 |       57 | 2017-01-01
 1483272360000 |         1 |        24.9 |       58 | 2017-01-01
 1483272300000 |         2 |        22.4 |       57 | 2017-01-01
 1483272300000 |         1 |        24.9 |       57 | 2017-01-01
[...]

I have this existing query that gets the highs and lows of each day, but not the specific time that that high or low occurred:

WITH t AS (
    SELECT date, highest, lowest
    FROM (
        SELECT date, max(temperature) AS highest
        FROM weather_data
        WHERE sensor_id = (SELECT sensor_id FROM weather_sensors WHERE sensor_name = 'outdoor')
        GROUP BY date
        ORDER BY date ASC
    ) h
    INNER JOIN (
        SELECT date, min(temperature) AS lowest
        FROM weather_data
        WHERE sensor_id = (SELECT sensor_id FROM weather_sensors WHERE sensor_name = 'outdoor')
        GROUP BY date
        ORDER BY date ASC
    ) l
    USING (date)
    ORDER BY date DESC
)
SELECT * from t ORDER BY date ASC;

There's a bit over two million rows in the database and it takes ~1.2 seconds to run, which isn't too bad. I want to now get the specific time that the high or low was, I came up with this using window functions, which does work but takes ~5.6 seconds:

SELECT h.date, high_time, high_temp, low_time, low_temp FROM (
    SELECT date, high_temp, high_time FROM (
        SELECT date, temperature AS high_temp, timestamp AS high_time, row_number()
        OVER (PARTITION BY date ORDER BY temperature DESC, timestamp DESC)
        FROM weather_data
        WHERE sensor_id = (SELECT sensor_id FROM weather_sensors WHERE sensor_name = 'outdoor')
    ) highs
    WHERE row_number = 1
) h
INNER JOIN (
    SELECT * FROM (
        SELECT date, temperature AS low_temp, timestamp AS low_time, row_number()
        OVER (PARTITION BY date ORDER BY temperature ASC, timestamp DESC)
        FROM weather_data
        WHERE sensor_id = (SELECT sensor_id FROM weather_sensors WHERE sensor_name = 'outdoor')
    ) lows
    WHERE row_number = 1
) l
ON h.date = l.date
ORDER BY h.date ASC;

Is there some relatively simple addition to the first query that I can make that won't add a large amount of execution time? I assume there is, but I think I'm at the point where I've been looking at the problem for too long!

VirtualWolf
  • 653
  • 13
  • 23
  • 1
    Possible duplicate of [PostgreSQL - fetch the row which has the Max value for a column](http://stackoverflow.com/questions/586781/postgresql-fetch-the-row-which-has-the-max-value-for-a-column) – Joe Jan 01 '17 at 12:52
  • 1
    Unrelated, but: the `order by` in the derived tables in the first query is useless –  Jan 01 '17 at 12:56
  • @a_horse_with_no_name Noted, thanks! – VirtualWolf Jan 02 '17 at 06:53

2 Answers2

3
SELECT  
        DISTINCT ON (zdate) zdate
        , first_value(ztimestamp) OVER www AS stamp_at_min
        , first_value(temperature) OVER www AS tmin
        , last_value(ztimestamp) OVER www AS stamp_at_max
        , last_value(temperature) OVER www AS tmax
FROM weather_data
WHERE sensor_id = 2
WINDOW www AS (PARTITION BY zdate ORDER BY temperature, ztimestamp
                ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                )
        ;

  • prefixed zdate and ztimestamp
  • I added ztimestamp to the ordering as a tie-breaker
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • That works nicely, thanks! Is there any additional index-related trickery that can be done to speed things up (it takes ~3.7 seconds to run), or is it at the point where there's not much that can be optimised for this sort of thing? – VirtualWolf Jan 01 '17 at 23:52
  • Your table basically has two candidate keys: your PK and possibly the {zdate, sensor_id, temperature, ...} which is not exactly unique. In any case, I think you should get rid of the single-column indexes. And zdate *could* be functionally dependent on ztimestamp (which *could* be a timestamp instead of an int) – wildplasser Jan 01 '17 at 23:58
  • Getting _rid_ of the single-column indexes? Interesting. I have a number of other (simpler) unrelated queries that I run over this table, which I'm guessing would end up being dramatically slower without indexes, no? – VirtualWolf Jan 02 '17 at 00:18
  • I cannot know your other queries... For this particular query, I would choose {sensor_id, zdate, temperature, ...} – wildplasser Jan 02 '17 at 00:22
  • I gave that a go and it didn't make any appreciable difference. Might just have to live with it taking a while. :) Thanks again! – VirtualWolf Jan 02 '17 at 00:28
2

This does the same as your second query, but only needs a single scan over the weather_data table:

select date, 
       max(case when high_rn = 1 then timestamp end) as high_time, 
       max(case when high_rn = 1 then temperature end) as high_temp, 
       max(case when low_rn = 1 then timestamp end) as low_time, 
       max(case when low_rn = 1 then temperature end) as low_temp
from (
  select timestamp, temperature, date, 
         row_number() OVER (PARTITION BY date ORDER BY temperature DESC, timestamp DESC) as high_rn,
         row_number() OVER (PARTITION BY date ORDER BY temperature ASC, timestamp DESC) as low_rn
  from weather_data
  where sensor_id = ...
) t
where (high_rn = 1 or low_rn = 1)
group by date;   

It uses conditional aggregation to do a crosstab (a.k.a. "pivot") query on the result that only contains the minimum and maximum temperatures.


Unrelated, but: date and timestamp are horrible names for columns. For one because they are keywords but more importantly because they don't document the actual meaning of the column. Is it a "due date"? A "reading date"? A "processing date"?

  • Thanks! This one takes ~5.2 seconds to run, versus 3.7 seconds for the one above. The column names are all the time and date that that particular temperature reading was taken, so I guess reading date and reading time. It's a personal project and is just me working on it (simply keeping the current temperature for inside and outside of my house). :) – VirtualWolf Jan 01 '17 at 22:47
  • Hah, I just remembered I needed to add a `temperature != 21.8` because the temperature sensors occasionally go weird and send a value of 21.8 to my application. After adding a subquery for the window function to run over to @wildplasser's query, and adding the simple `where temperature != 21.8` to yours, they're both now within about 100ms of each other! – VirtualWolf Jan 02 '17 at 07:09