750 Mio rows. You need an efficient index. Consider this function and the index based on it.
Assuming a table weather
with a date column date
:
CREATE FUNCTION f_mmdd(date) -- or timestamp input?
RETURNS int LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';
CREATE INDEX weather_mmdd_idx ON weather(f_mmdd(date));
This index helps to quickly identify all rows for a particular day of the year.
The manual about EXTRACT
.
The above expression proved fastest for various reasons. Just re-ran some performance tests in Postgres 13, and nothing changed.
Details in this closely related answer:
There is also EXTRACT(doy FROM date)
to extract the day of the year (1–365/366), which is even faster. But, obviously, there is an off-by-one error for dates past Feb 29 in leap years in the Gregorian calendar.
Then the query for Jan 01 can be:
SELECT date_trunc('day', date) -- if it's a timestamp column
-- date -- if it's really a date column (which I find hard to believe)
, avg(temperature) AS avg_temperature
, avg(rain) AS avg_rain
-- , ...
FROM weather
WHERE f_mmdd(date) = f_mmdd('2000-01-01') -- or just 101 for Jan 01
GROUP BY 1;
The year in f_mmdd('2000-01-01')
is arbitrary. Or just use the integer 101
for Jan 01.
You might be able to optimize further with multicolumn indexes for particular dimensions (temperature, rain, ...). But that depends on undisclosed details.
Sounds like the dataset isn't going to change. So a MATERIALIZED VIEW
with readily computed aggregates per day might be a better alternative in the long run.
A word of warning: Computed averages are only correct if the measurements are spread out evenly across each day. Else, computed numbers are just averages of the given numbers, not actual average values for each day.