Select data within one month prior to each user's last record

Question

Assume I have a table called "Diary" like this:

| id | user_id |        recorded_at       | record |
|----|---------|--------------------------|--------|
| 20 |  50245  |2017-10-01 23:00:14.765366|   89   |
| 21 |  50245  |2017-12-05 10:00:33.135331|   97   |
| 22 |  50245  |2017-12-31 11:50:23.965134|   80   |
| 23 |  76766  |2015-10-06 11:00:14.902452|   70   |
| 24 |  76766  |2015-10-07 22:40:59.124553|   81   |

For each user I want to retrieve the latest row and all rows within one month prior to that.

In other words, for user_id 50245, I want the his/her data from "2017-12-01 11:50:23.965134" to "2017-12-31 11:50:23.965134"; for user_id 76766, I want his/her data from "2015-09-07 22:40:59.124553" to "2015-10-07 22:40:59.124553".

Hence the desired result looks like this:

| id | user_id |        recorded_at       | record |
|----|---------|--------------------------|--------|
| 21 |  50245  |2017-12-05 10:00:33.135331|   97   |
| 22 |  50245  |2017-12-31 11:50:23.965134|   80   |
| 23 |  76766  |2015-10-06 11:00:14.902452|   70   |
| 24 |  76766  |2015-10-07 22:40:59.124553|   81   |

Please note that the record of id 20 is not included because it is more than one month prior to user_id 50245's last record.

Is there any way I can write an SQL query to achieve this?

@ThorstenKettner I really have no idea. If you have hints or key words to search, please kindly give some. Sorry for being a newbie in SQL. — ytu, Jan 19 '18 at 16:41
The best query technique depends on the details of your setup: Postgres version (`SELECT version();`), table definition (are `user_id` and `recorded_at` defined `NOT NULL`?), cardinalities (how many rows?), value frequencies (how many rows per user / month; min/max/avg). And is there a separate table `users` with exactly 1 row for (at least) every relevant user? — Erwin Brandstetter, Jan 19 '18 at 17:59
@ErwinBrandstetter apologies for late replies. Your posts are awesome. Please see my comment below your answer and enlighten me. Thanks. — ytu, Jan 23 '18 at 04:41

score 2 · Answer 1 · answered Jan 19 '18 at 16:54

2

I would be inclined to use window functions:

select d.*
from (select d.*, max(d.recorded_at) over (partition by d.user_id) as max_recorded_at
      from diary d
     ) d
where recorded_at >= max_recorded_at - interval '1 month';

answered Jan 19 '18 at 16:54

Gordon Linoff

1,242,037
58
646
786

score 1 · Answer 2 · answered Jan 19 '18 at 16:54

The straightforward way is to use a subquery to get the max recorded_at for each user_id and then join:

select d.*
  from diary d
       join ( select user_id, max(recorderd_at) mra
                from diary
               group by user_id ) m on d.user_id = m.user_id
 where m.mra <= d.recorded_at + interval '1 month'

this has the drawback of accessing the table twice (may be different in different RDBMS - use explain to see the execution plan).

A better alternative is to use window functions to do everything in one pass:

select id, user_id, recorderd_at
  from ( select *, max(recorderd_at) over (partition by user_id) as mra
           from diary ) x
 where mra <= recorderd_at + interval '1 months'

Disclaimer I did not test the queries above, but you should get the idea anyway - see http://sqlfiddle.com/#!17/e90000/9 for a working example w/ similar schema

score 0 · Answer 3 · answered Jan 19 '18 at 16:39

0

Not tested but something like this should work.

I would use a subquery to get the last_record then filter out those at the date and the previous month like for example :

select d.* from diary d,
(select max(recorded_at) l from diary group by user_id) as last_record 
where  d.recorded_at = last_record.l
or
  ( 
   d.recorded_at  >= date_trunc('month', last_record.l - interval '1' month)
   and d.recorded_at  < last_record.l
  )

answered Jan 19 '18 at 16:39

Laurent B

2,200
19
29

Thanks! I will definitely try that in these days. One question though: for the `,` following `select d.* from diary d`, is that a way to create a subquery and use the result later? I thought subqueries can be written only in `**** JOIN` clauses. – ytu Jan 19 '18 at 16:56

Erwin Brandstetter · Accepted Answer · 2018-01-23T23:29:03.827

0

For small tables, any (valid) query technique is good.

For big tables, details matter. Assuming:

There is also a users table with user_id as PK containing all relevant users (or possibly a few more). This is the typical setup.
You have (or can create) an index on diary (user_id, recorded_at DESC NULLS LAST). NULLS LAST is optional if recorded_at is defined NOT NULL. But make sure the query matches the index.
More than a few rows per user - the typical use case.

This should be among the fastest options:

SELECT d.*
FROM   users u
CROSS  JOIN LATERAL (
   SELECT recorded_at
   FROM   diary
   WHERE  user_id = u.user_id
   ORDER  BY recorded_at DESC NULLS LAST
   LIMIT 1
   ) d1
JOIN   diary d ON d.user_id = u.user_id
              AND d.recorded_at >= d1.recorded_at - interval '1 month'
ORDER  BY d.user_id, d.recorded_at;

Produces your desired result exactly.

For only few rows per user, max() or DISTINCT ON () in a subquery are typically faster.

Related (with detailed explanation):

About the FROM clause:

edited Jan 23 '18 at 23:29

answered Jan 19 '18 at 19:39

Erwin Brandstetter

605,456
145
1,078
1,228

I have some following questions: First, why does `SELECT d.* FROM users u` in the first two lines work? Usually I would do something like `SELECT * FROM diary d INNER JOIN users u ON d.user_id = u.user_id`. Second, why do you use `CROSS JOIN LATERAL` instead of `LEFT JOIN LATERAL`? Is it because the former doesn't need an `ON` statement? Third and finally, what kind of join is the last `JOIN` in your demonstration? – ytu Jan 23 '18 at 04:38
1

@ytu: `d` is the table alias for `diary` which we join in later. `CROSS JOIN` is an unconditional join that only requires a row on each side - as opposed to `LEFT [OUTER] JOIN` which requires a join condition, but retains all rows on the left side, even without match on the right side. `INNER` and `OUTER` are optional noise words. More explanation in the added links and basics in the manual: https://www.postgresql.org/docs/current/static/sql-select.html#SQL-FROM. – Erwin Brandstetter Jan 23 '18 at 12:13
I am aware of the alias setting `d` for `diary`. However *selecting* columns in `diary` *from* `users` is not intuitive to me, and I found little information about how `FROM` clause works in this situation. Doesn't the order of `FROM` clause matter? How is it better than `SELECT d.* FROM diary d CROSS JOIN LATERAL` ... `JOIN users u ON d.user_id = u.user_id`? – ytu Jan 23 '18 at 14:07
1

@ytu: Columns from all table expressions listed in the `FROM` clause (optionally joined with explicit `JOIN` syntax) are visible in the `SELECT` list. Typically, the order of items in the `FROM` clause does not matter. But there are various exceptions, `LATERAL` joins being a notable one. I added some more pointers above. – Erwin Brandstetter Jan 23 '18 at 23:28

Select data within one month prior to each user's last record

4 Answers4

Linked