How to aggregate data from multiple years on MM-DD, ignoring year

Question

Postgres version 9.4.18, PostGIS Version 2.2.

Here are the tables I'm working with (and can unlikely make significant changes to the table structure):

Table ltg_data (spans 1988 to 2018):

 Column   |           Type           | Modifiers 
----------+--------------------------+-----------
intensity | integer                  | not null
time      | timestamp with time zone | not null
lon       | numeric(9,6)             | not null
lat       | numeric(8,6)             | not null
ltg_geom  | geometry(Point,4269)     | 
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")

Size of ltg_data (~800M rows):

ltg=# select pg_relation_size('ltg_data');
pg_relation_size 
------------------
 149729288192

Table counties:

  Column   |            Type             |                       Modifiers                      
-----------+-----------------------------+--------------------------------- -----------------------
gid        | integer                     | not null default nextval('counties_gid_seq'::regclass)
objectid_1 | integer                     | 
objectid   | integer                     | 
state      | character varying(2)        | 
cwa        | character varying(9)        | 
countyname | character varying(24)       | 
fips       | character varying(5)        | 
time_zone  | character varying(2)        | 
fe_area    | character varying(2)        | 
lon        | double precision            | 
lat        | double precision            | 
the_geom   | geometry(MultiPolygon,4269) | 
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)

Desired result: I want a time series with one row for every day of the year in format 'MM-DD' ignoring the year: 01-01, 01-02, 01-03, ..., 12-31. And the count of rows in table ltg_data for each day of the year. I also eventually want the same thing for every hour of every day of the year ('MM-DD-HH').

A group by statement should accomplish this, but I'm having a hard time joining the "big" table with the days generated with generate_series().

MM-DD  | total_count   
-------+------------
12-22  |       9
12-23  |       0
12-24  |       0
12-25  |       0
12-26  |      23
12-27  |       0
12-28  |       5
12-29  |       0
12-30  |       0
12-31  |       0

Some of my many attempted queries:

SELECT date_trunc('day', d),
   count(a.lat) AS strikes
FROM generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d
LEFT JOIN
(SELECT date_trunc('day', TIME) AS day_of_year,
      ltg_data.lat
 FROM ltg_data
 JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
 WHERE cwa = 'MFR' ) AS a ON d = day_of_year
GROUP BY d
ORDER BY d ASC;

But this doesn't ignore year. I shouldn't be surprised because the "day" in date_trunc is still considering the year I guess.

2017-12-27 00:00:00-08 |       0
2017-12-28 00:00:00-08 |       0
2017-12-29 00:00:00-08 |       0
2017-12-30 00:00:00-08 |       0
2017-12-31 00:00:00-08 |       0
2018-01-01 00:00:00-08 |       0
2018-01-02 00:00:00-08 |       12
2018-01-03 00:00:00-08 |       0

And this query, in which I'm trying to convert the data from generate_series() to text in 'DD-MM' format to join to the ltg_data table in text format. Says the data types don't match. I've tried extract as well, since that could provide "doy" and "hour", which would work, but I can't seem to match data types in that query either. It's hard to make that "generate_series" a double precision.

SELECT to_char(d, 'MM-DD') AS DAY,
   count(a.lat) AS strikes
FROM
(SELECT generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d) 
AS f
LEFT JOIN
(SELECT to_char(TIME, 'MM-DD') AS day_of_year,
      ltg_data.lat
FROM ltg_data
JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR' ) AS a ON f = day_of_year
GROUP BY d
ORDER BY d ASC;

Result:

ERROR:  operator does not exist: record = text
LINE 4: ON f = day_of_year group by d order by d asc;
         ^
HINT:  No operator matches the given name and argument type(s). You might 
need to add explicit type casts.

Conclusion: I'm aiming at getting daily and hourly total counts that span many years but group by 'MM-DD' and 'MM-DD-HH' (ignoring year), with the query results showing all days/hours even if they are zero.

Later I'll also try to find averages and percentiles for days and hours, so if you have any advice on that, I'm all ears. But my current problem is focused on just getting a complete result for totals.

`My desired result is to left join ...`. No. Please explain your desired *result* in plain English first. Don't start with the techniques you have in mind to get there. Define your time frame (which rows from the table are considered) and which days are supposed to be in the result. — Erwin Brandstetter, May 16 '18 at 16:15
Sorry about that. I added version now to the top of the question. I tried a lot of queries for generate_series. All I want generate_series to do is to generate a series of months and days (and months, days, and hours for the next query I try)...to left join to the data I have that spans 30 years. I shouldn't have put the "2 years" of generate_series data in there because that's a little misleading. It seems I'm having a hard time generating a series that doesn't correspond to a specific time. I want to join on month and day (day of year) and eventually hour...while ignoring year. — user1610717, May 16 '18 at 16:16

Erwin Brandstetter · Accepted Answer · 2018-05-20T01:23:27.583

Basically, to cut off the year, to_char(time, 'MMDD') like you already tried does the job. You just forgot to also apply it to the timestamps generated with generate_series()before joining. And some other minor details.

To simplify and for performance and convenience I suggest this simple function to calculate an integer from the pattern 'MMDD' of a given timestamp.

CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';

I used to_char(time, 'MMDD') at first, but switched to the above expression that turned out to be fastest in various tests.

db<>fiddle here

It can be used in expression indexes since it's defined IMMUTABLE. And it still allows function inlining because it only uses EXTRACT (xyz FROM date) - which is implemented with the IMMUTABLE function date_part(text, date) internally. (Note that datepart(text, timestamptz) is only STABLE).

Then this kind of query does the job:

SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM  (
   SELECT f_mmdd(d::date) AS mmdd  -- ignoring the year
   FROM   generate_series(timestamp '2018-01-01'  -- any dummy year
                        , timestamp '2018-12-31'
                        , interval '1 day') d
   ) d
LEFT  JOIN (
   SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
   FROM   counties c
   JOIN   ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
   WHERE  cwa = 'MFR'
   GROUP  BY 1
   ) ct USING (mmdd)
ORDER  BY 1;

Since time (I would use a different column name) is data type timestamptz the cast time::date depends on the time zone setting of your current session. ("Days" are defined by the time zone you are in.) To get immutable (but slower) results use the AT TIME ZONE construct with a time zone name like:

SELECT f_mmdd((time AT TIME ZONE 'Europe/Vienna')::date) ...

Details:

Ignoring time zones altogether in Rails and PostgreSQL

Format mmdd any way you like for display.

The cast to integer is optional for the purpose of this particular query. But since you plan to do all kinds of queries, you'll end up wanting an index on the expression:

CREATE INDEX ltg_data_mmdd_idx ON event(f_mmdd(time));

(Not needed for this query.)
integer is a bit faster for this purpose. ~~And you need the (otherwise optional) function wrapper for this since to_char() is only defined STABLE, but we need IMMUTABLE for the index.~~ The updated expression (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int is IMMUTABLE, but the function wrapper is still convenient.

Thanks so much Erwin. It mostly makes sense...will have read up some. I just used my first function! I know, it's horrible to say that. Is it too much to ask how I'm supposed to create a function that does the same thing, but for hours? I created a function using "CREATE FUNCTION f_mmddhh(date) RETURNS int LANGUAGE sql IMMUTABLE as $$SELECT to_char($1, 'MMDDHH')::int$$ but it is returning toals for the day still and just 12s for the hour. I read your whole post but couldn't glean how I'm supposed to use it for hours. Thanks for any help. I can post a new question if this is too unrelated. — user1610717, May 16 '18 at 19:01
I changed the ::date to ::timestamp in the function and the query, and now I'm getting reasonable results, but it's only showing 12 hours and duplicating hours. Working on it. — user1610717, May 16 '18 at 19:12
Okay, think I have it...in case this helps anyone...I needed to use HH24 instead of HH to get the full 24 hour period. — user1610717, May 16 '18 at 19:16
@user1610717: It's not horrible at all to ask a question. This is a Q&A site! But do it in another *question*, comments are not the place. You can always link to this one for context and leave a comment here to link back (and get my attention). — Erwin Brandstetter, May 16 '18 at 21:15
I figured it out for the HH24. Not only did I have to use HH24, but I needed to use f_mmddhh(timestamp) because timestamp works with hours and date did not. Probably obvious, but in case anyone needs this. Thanks again! For any additional questions, I'll post as a full new question. — user1610717, May 19 '18 at 18:08
@user1610717: I found a function based on `(EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int` to perform better since it allows function inlining. Consider the update. — Erwin Brandstetter, May 20 '18 at 01:24

How to aggregate data from multiple years on MM-DD, ignoring year

1 Answers1

Linked