Postgres: select query with group by clause on a range of dates

Question

My table contains answers from repeatable questionnaires that can be filled in a range of 30 days and are scheduled every 60 days. Therefore, the answers from a single instance of a questionnaire are spread in a range of date that is always smaller tha 30 days and the first answer to the following repeatable questionnaire is at least 31 days after the last answer of the previous one. How do I create a view that calculate a score (which is basically the sum of the answers of a single questionnaire) among the values whose dates are within 30 days from the start date (min date)?

Table raw_data
------------------------------------------------
user_name | question_id | answer | answer_date |
------------------------------------------------
user001   |      1      |   2    | 2019-02-04  |
user001   |      2      |   1    | 2019-02-04  |
user001   |      3      |   2    | 2019-02-05  |
user001   |      4      |   2    | 2019-02-05  |
user001   |      5      |   2    | 2019-02-09  |
user002   |      1      |   2    | 2019-01-09  |
user002   |      2      |   2    | 2019-01-10  |
user002   |      3      |   1    | 2019-02-01  |
user002   |      4      |   2    | 2019-02-01  |
user002   |      5      |   1    | 2019-02-01  |
user002   |      1      |   2    | 2019-03-11  |
user002   |      2      |   2    | 2019-03-11  |
user002   |      3      |   1    | 2019-03-12  |
user002   |      4      |   1    | 2019-03-13  |
user002   |      5      |   1    | 2019-03-14  |


Expected result
------------------------------
user_name | sum | start_date |
------------------------------
user001   |  9  | 2019-02-04 | 
user002   |  8  | 2019-01-09 |
user002   |  7  | 2019-03-11 |

The solution I tried works for the first group only:

SELECT user_name, SUM(answer::int),
CASE 
WHEN answer_date - MIN(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) < 30 
THEN MIN(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) 
ELSE answer_date END AS start_date,
FROM public.raw_data
GROUP BY user_name, answer_date

https://stackoverflow.com/search?q=group+by+date+range+postgresql — Ben, Mar 26 '19 at 14:20
Use a calendar table or a subexspression which generates a calendar table. — Ben, Mar 26 '19 at 14:20
Thank you Ben, I read these answers but I couldn't find a solution to my issue. The problem is that I don't have any start date: only ranges that are greater or less than 30 days, that can start at any date and have a variable number of records. — ecalanchi, Mar 26 '19 at 14:26
yes, and even more... I searched that keywords in almost any possible combination before posting my question! — ecalanchi, Mar 26 '19 at 14:45
I was wrong. No need to pay anyone, three people have done it for you without so much as an upvote. — Ben, Mar 26 '19 at 20:13

Erwin Brandstetter · Answer 1 · 2019-03-26T17:08:35.980

1

It's a classical gaps-and-islands problem. You'll find a lot under the tag I added.

An optimized query for your case could look like:

SELECT user_name
     , sum(answer)
     , min(answer_date) AS start_date 
FROM  (
   SELECT user_name, answer, answer_date
        , count(*) FILTER (WHERE step) OVER (PARTITION BY user_name ORDER BY answer_date) AS grp
   FROM  (
      SELECT user_name, answer, answer_date
           , lag(answer_date) OVER (PARTITION BY user_name ORDER BY answer_date) < answer_date - 30 AS step
      FROM   raw_data
      ) sub1
   ) sub2
GROUP  BY user_name, grp
ORDER  BY user_name, start_date;  -- ORDER BY optional

db<>fiddle here

Closely related, with more explanation:

How to group timestamps into islands (based on arbitrary gap)?

edited Mar 26 '19 at 17:08

answered Mar 26 '19 at 17:00

Erwin Brandstetter

605,456
145
1,078
1,228

Thank you @Erwin, I didn't know the problem in these terms. However, I would point out that this solution doesn't seem to work on a more realistic dataset. I added also max(date) to the query and I found that a user has a single group (span of 3 months) and onother one has 2 groups (span of 12 and 1 day) that should be merged in a single one. – ecalanchi Mar 27 '19 at 10:40
@ecalanchi: Well, the questions defines: `the answers from a single instance of a questionnaire are spread in a range of date that is always smaller than 30 days and the first answer to the following repeatable questionnaire is at least 31 days after the last answer of the previous one` I only enforce a minimum gap of "at least 31 days", as the first condition can create contradicting states: What about a row 10 days from the previous one, spreading the range to more than 30 days? Other than that, you'd have to present a test case where this query fails - in a fiddle ideally. – Erwin Brandstetter Mar 28 '19 at 00:55

score 0 · Answer 2 · answered Mar 26 '19 at 15:33

Use lag() to find the gaps. Then a cumulative sum to assign a "question period" and then summarize:

select userid, min(answer_date) as start_date, sum(answer)
from (select rd.*,
             count(*) filter (where prev_ad is null or prev_ad < answer_date - interval '30 day') over (partition by user_id) as period
      from (select rd.*,
                   lag(answer_date) over (partition by user_id order by answer_date) as prev_ad
            from raw_data rd
           ) rd
     )
group by userid, period;

score 0 · Answer 3 · answered Mar 26 '19 at 16:29

Thanks to @Gordon and to this answer I eventually found the missing step to determine my groups on a date range basis.

I will use the following query to create a view and SUM answers grouping by grp2

WITH query AS (
SELECT r.*,
SUM(CASE WHEN answer_date < prev_date + 30 THEN 0 ELSE 1 END) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) AS grp
  FROM (SELECT r.*,
    LAG(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) AS prev_date
    FROM raw_data r 
  ) r
)
SELECT user_name, question_id, answer_date, answer, DENSE_RANK() OVER (ORDER BY user_name, grp) AS grp2
FROM query

Barbaros Özhan · Answer 4 · 2019-03-26T20:55:20.537

You can use the query with row_number() window analytic function as below

with raw_data( user_name, question_id, answer, answer_date ) as
(
 select  'user001',1,2, '2019-02-04' union all
 select  'user001',2,1, '2019-02-04' union all
 select  'user001',3,2, '2019-02-05' union all
 select  'user001',4,2, '2019-02-05' union all
 select  'user001',5,2, '2019-02-09' union all
 select  'user002',1,2, '2019-01-09' union all
 select  'user002',2,2, '2019-01-10' union all
 select  'user002',3,1, '2019-02-01' union all
 select  'user002',4,2, '2019-02-01' union all
 select  'user002',5,1, '2019-02-01' union all
 select  'user002',1,2, '2019-03-11' union all
 select  'user002',2,2, '2019-03-11' union all
 select  'user002',3,1, '2019-03-12' union all
 select  'user002',4,1, '2019-03-13' union all
 select  'user002',5,1, '2019-03-14'
)    
select user_name, sum(answer) as sum, min(answer_date) as start_date
  from 
  (
   select row_number() over (partition by question_id order by user_name, answer_date) as rn,
          t.*
     from raw_data t
   ) t
  group by user_name, rn
  order by rn;

user_name   sum   start_date
---------   ---   ----------
user001     9     2019-02-04
user002     8     2019-01-09
user002     7     2019-03-11

Demo

Postgres: select query with group by clause on a range of dates

4 Answers4