5

Postgres version 9.4.18, PostGIS Version 2.2.

Here are the tables I'm working with (and can unlikely make significant changes to the table structure):

Table ltg_data (spans 1988 to 2018):

Column   |           Type           | Modifiers 
----------+--------------------------+-----------
intensity | integer                  | not null
time      | timestamp with time zone | not null
lon       | numeric(9,6)             | not null
lat       | numeric(8,6)             | not null
ltg_geom  | geometry(Point,4269)     | 
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")

Size of ltg_data (~800M rows):

ltg=# select pg_relation_size('ltg_data');
pg_relation_size 
------------------
149729288192

Table counties:

 Column   |            Type             |                       Modifiers                      
-----------+-----------------------------+---------------------------------        -----------------------
gid        | integer                     | not null default        
nextval('counties_gid_seq'::regclass)
objectid_1 | integer                     | 
objectid   | integer                     | 
state      | character varying(2)        | 
cwa        | character varying(9)        | 
countyname | character varying(24)       | 
fips       | character varying(5)        | 
time_zone  | character varying(2)        | 
fe_area    | character varying(2)        | 
lon        | double precision            | 
lat        | double precision            | 
the_geom   | geometry(MultiPolygon,4269) | 
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)

I have a query that calculates the total number of rows per day of the year (month-day) spanning the 30 years. With the help of Stackoverflow, the query to get these counts is working fine. Here's the query and results, using the following function.

Function:

CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
$$SELECT to_char($1, 'MMDD')::int$$;

Query:

SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM  (
SELECT f_mmdd(d::date) AS mmdd  -- ignoring the year
FROM   generate_series(timestamp '2018-01-01'  -- any dummy year
                    , timestamp '2018-12-31'
                    , interval '1 day') d
) d
LEFT  JOIN (
SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
FROM   counties c
JOIN   ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE  cwa = 'MFR'
GROUP  BY 1
) ct USING (mmdd)
ORDER  BY 1;

Results:

mmdd       total_count
725 |        2126
726 |         558
727 |           2
728 |           2
729 |           2
730 |           0
731 |           0
801 |           0
802 |          10

Desired Results: I'm trying to find other statistical information about the counts for the days of the year. For instance, I know on July 25 (725 in the table below) that the total count over the many years that are in the table is 2126. What I'm looking for is the max daily count for July 25 (725), percent of years that that day is not zero, the min, percent years where count(*) is not zero, percentiles (10th percentile, 25th percentile, 50th percentile, 75th percentile, 90th percentile, and stdev would be useful too). It would be good to see what year the max_daily occurred. I guess if there haven't been any counts for that day in all the years, the year_max_daily would be blank or zero.

mmdd       total_count  max daily  year_max_daily   percent_years_count_not_zero  10th percentile_daily   90th percentile_daily
725 |        2126         1000          1990                 30                          15                   900
726 |         558          120          1992                 20                          10                   80
727 |           2            1          1991                 2                            0                   1
728 |           2            1          1990                 2                            0                   1
729 |           2            1          1989                 2                            0                   1
730 |           0            0                               0                            0                   0 
731 |           0            0                               0                            0                   0 
801 |           0            0                               0                            0                   0
802 |          10           10          1990                 0                            1                   8

What I've tried thus far just isn't working. It returns the same results as total. I think it's because I'm just trying to get an avg after the totals have already been calculated, so I'm not really looking at the counts for each day of each year and finding the average.

Attempt:

SELECT AVG(CAST(total_count as FLOAT)), day
FROM
(
SELECT d.mmdd as day, COALESCE(ct.ct, 0) as total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd
FROM generate_series(timestamp '2018-01-01', timestamp '2018-12-31',     interval '1 day') d
) d
LEFT JOIN (

SELECT mmdd, avg(q.ct) FROM (

SELECT f_mmdd((time at time zone 'utc+12')::date) as mmdd, count(*) as ct
FROM counties c
JOIN ltg_data d on ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
) 

) as q

ct USING (mmdd)
ORDER BY 1

Thanks for any help!

Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
user1610717
  • 471
  • 5
  • 16

1 Answers1

5

I haven't included calculations for all requested stats - there is too much in one question, but I hope that you'd be able to extend the query below and add extra stats that you need.

I'm using CTE below to make to query readable. If you want, you can put it all in one huge query. I'd recommend to run the query step-by-step, CTE-by-CTE and examine intermediate results to understand how it works.

CTE_Dates is a simple list of all possible dates for 30 years.

CTE_DailyCounts is a list of basic counts for each day for 30 years (I took your existing query for that).

CTE_FullStats is again a list of all dates together with some stats calculated for each (month,day) using window functions with partitioning by month,day. ROW_NUMBER there is used to get a date where the count was the largest for each year.

Final query selects only one row with the largest count for the year along with the rest of the information.

I didn't try to run the query, because the question doesn't have sample data, so there may be some typos.

WITH
CTE_Dates
AS
(
    SELECT
        d::date AS dt
        ,EXTRACT(MONTH FROM d::date) AS dtMonth
        ,EXTRACT(DAY FROM d::date) AS dtDay
        ,EXTRACT(YEAR FROM d::date) AS dtYear
    FROM
        generate_series(timestamp '1988-01-01', timestamp '2018-12-31', interval '1 day') AS d
        -- full range of possible dates
)
,CTE_DailyCounts
AS
(
    SELECT
        time::date AS dt
        ,count(*) AS ct
    FROM
        counties c
        INNER JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
    WHERE cwa = 'MFR'
    GROUP BY time::date
)
,CTE_FullStats
AS
(
    SELECT
        CTE_Dates.dt
        ,CTE_Dates.dtMonth
        ,CTE_Dates.dtDay
        ,CTE_Dates.dtYear
        ,CTE_DailyCounts.ct
        ,SUM(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS total_count
        ,MAX(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS max_daily
        ,SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay) AS nonzero_day_count
        ,COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS years_count
        ,100.0 * SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay) 
        / COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS percent_years_count_not_zero
        ,ROW_NUMBER() OVER (PARTITION BY dtMonth, dtDay ORDER BY CTE_DailyCounts.ct DESC) AS rn
    FROM
        CTE_Dates
        LEFT JOIN CTE_DailyCounts ON CTE_DailyCounts.dt = CTE_Dates.dt
)
SELECT
    dtMonth
    ,dtDay
    ,total_count
    ,max_daily
    ,dtYear AS year_max_daily
    ,percent_years_count_not_zero
FROM
    CTE_FullStats
WHERE
    rn = 1
ORDER BY
    dtMonth
    ,dtDay
;
Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
  • Thank you SO much. I'm having fun going through and figuring exactly how you achieved the results. Thanks for your time on this. – user1610717 May 25 '18 at 15:30
  • @user1610717, you are welcome. I'd recommend to run each step (CTE) of the query and examine each intermediate result to understand what is going on. – Vladimir Baranov May 28 '18 at 11:05
  • I'm really struggling trying to use percentile_cont within the CTE_FullStats section of the query. When I add percentile_cont to the query, all I get is a blank column. I've read a good bit about percentile_cont, but I can't figure out how to get it to work. I though perhaps the "where rn=1" was preventing it from working right. However, why would the max() and sum() portions work? Do you have any insight into how I can get percentiles to work under this query framework? Thanks! – user1610717 Jun 02 '18 at 17:36
  • @user1610717, Yes, `where rn=1` is likely to disrupt other calculations. It is not clear for me in the question how these percentiles should be calculated (that's why I didn't attempt to put them in the answer). You can always write a second query that calculates just percentiles you need for month/day and then join two results together on month,day. Try to put extra calculations in `CTE_FullStats` and don't use the last bit with `where rn=1` at first. See if you can get results you expect in `CTE_FullStats`. You can also ask another **simpler** question which focuses on percentiles only. – Vladimir Baranov Jun 02 '18 at 23:55
  • I've removed rn=1 and couldn't decipher the results. I thought since the query was successful at doing the sum statistics, it would be able to return a percentile. I have submitted another question, focusing on percentiles. I left the other portions of the query in, but if percentiles can't be incorporated into this already long query, I understand. Thanks again for your help. – user1610717 Jun 04 '18 at 06:26
  • @user1610717, I guess, you need to read up the docs and understand what `OVER (PARTITION BY ...)` does. Understanding this construct should help you decipher/understand the results. Without this knowledge you can't be sure that answers that you get have queries that produce correct results. – Vladimir Baranov Jun 04 '18 at 06:37