-1

I have a daily weather data SQL table with columns including date, type (temperature, rain, wind etc measurement) and value. The dataset spans 20 years of data.

How can I calculate daily averages for each day and measurement type, averaging values for the given date from data for all the 20 years in question? So e.g. I want to see the average temperature for 1 Jan (average of temperatures for 1 Jan 2020, 1 Jan 2019, etc)

Given there's a total of 750 million rows of data, should I create a materialised view of the calculations or what's the best way to cache the answers?

Villahousut
  • 115
  • 1
  • 11

4 Answers4

1

it would depend on which sql server you use, but in general, you should extract the day and the month from the date (on Microsoft SQL Server it is the DATEPART function) and then group by that and calculate the averages.

SELECT DATEPART(month, date_col) AS Month, 
   DATEPART(day, date_col) AS Day,
   AVG(temp) AS Temp,
   AVG(rain) AS Rain,
   ...
FROM table
GROUP BY DATEPART(month, date_col), DATEPART(day, date_col)
Z .
  • 12,657
  • 1
  • 31
  • 56
1

You need to extract the month and day from the date. The standard SQL function uses extract():

select extract(month from date) as month, extract(day from date) as day,
       avg(temperature), avg(rain), . . .
from t
group by extract(month from date), extract(day from date);

Not all databases support these standard functions so you may need to use the functions specific to your (unspecified) database.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

There is an extension to postgresql called timescaledb that makes it easier to query this type of data. Beware that it does make changes to the postgresql-database that requires changes to backup-routines. And if the current database is partitioned it will require a dump and restore.

A query can look like this:

-- By month
select
  extract(year from created_at) as year,
  extract(month from time_bucket('1 day', created_at)) as month,
  min(temp) as temp,
from
  readings
where
  created_at > '2019-01-01' and created_at < '2020-01-01'
group by
  year,
  month
order by
  year,
  month;
kometen
  • 6,536
  • 6
  • 41
  • 51
0

750 Mio rows. You need an efficient index. Consider this function and the index based on it.
Assuming a table weather with a date column date:

CREATE FUNCTION f_mmdd(date)  -- or timestamp input?
   RETURNS int LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';

CREATE INDEX weather_mmdd_idx ON weather(f_mmdd(date));

This index helps to quickly identify all rows for a particular day of the year.

The manual about EXTRACT.

The above expression proved fastest for various reasons. Just re-ran some performance tests in Postgres 13, and nothing changed.
Details in this closely related answer:

There is also EXTRACT(doy FROM date) to extract the day of the year (1–365/366), which is even faster. But, obviously, there is an off-by-one error for dates past Feb 29 in leap years in the Gregorian calendar.

Then the query for Jan 01 can be:

SELECT date_trunc('day', date) --  if it's a timestamp column
    -- date  --  if it's really a date column (which I find hard to believe) 
     , avg(temperature) AS avg_temperature
     , avg(rain) AS avg_rain
  -- , ...
FROM   weather
WHERE  f_mmdd(date) = f_mmdd('2000-01-01')  -- or just 101 for Jan 01
GROUP  BY 1;

The year in f_mmdd('2000-01-01') is arbitrary. Or just use the integer 101 for Jan 01.

You might be able to optimize further with multicolumn indexes for particular dimensions (temperature, rain, ...). But that depends on undisclosed details.

Sounds like the dataset isn't going to change. So a MATERIALIZED VIEW with readily computed aggregates per day might be a better alternative in the long run.

A word of warning: Computed averages are only correct if the measurements are spread out evenly across each day. Else, computed numbers are just averages of the given numbers, not actual average values for each day.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228