Plain SQL
As Bulba has said appropriate way is to join a subquery with group by.
JPA, JPQL
The problem is that you can't join a subquery.
Here is a workaround.
Lets see what you get in the subquery with group by. You get a list of pairs (attendee_id, max(meeting_date))
.
This pair is like a new unique id for row with max date you want to join on.
Then note that each row in the table forms a pair (attendee_id, meeting_date)
.
So every row has an id as a pair (attendee_id, meeting_date)
.
Lets take a row if only it forms an id that belongs to list received in the subquery.
For simplicity lets represent this id-pair as a concatenation of attendee_id
and meeting_date
: concat(attendee_id, meeting_date)
.
Then the query in SQL(similarly for JPQL and JPA CriteriaBuilder) would be as follows:
SELECT * FROM meetings
WHERE concat(attendee_id, meeting_date) IN
(SELECT concat(attendee_id, max(meeting_date)) FROM meetings GROUP BY attendee_id)
Note that there is only one subquery per query, not one subquery for each row like in some answers.
Afraid of comparing strings?
We have a special offer for you!
Lets encode that id-pair to number.
It will be a sum of attendee_id
and meeting_date
but with modifications to ensure uniqueness of code. We can take number representation of date as Unix time.
We will fix the value of max date that our code can capture because final code has max value limit (e.g. bigint(int8)<263). Lets take for convenience max date as 2149-06-07 03:00:00. It equals 5662310400 in seconds and 65536 in days.
I will assume here that we need precision for date in days(so we ignore hours and below).
To construct unique code we can interpret it as a number in a numerical system with base of 65536. The last symbol(number from 0 to 216-1) in or code in such numerical system is number of days. Other symbols will capture attendee_id
. In such interpretation code looks like XXXX
, where each X is in range [0,216-1] (to be more accurate, first X is in range [0,215-1] because of 1 bit for sign), first three X represents attendee_id
and last X represents meeting_date
.
So the max value of attendee_id
our code can capture is 247-1.
The code can be computed as attendee_id
*65536 + "date in days".
In postgresql it will be:
attendee_id*65536 + date_part('epoch', meeting_date)/(60*60*24)
Where date_part
returns date in seconds which we convert to days by dividing on constant.
And final query to get the latest meetings for all attendees:
SELECT * FROM meetings
WHERE attendee_id*65536 + date_part('epoch', meeting_date)/(60*60*24)
IN (SELECT attendee_id*65536 + date_part('epoch', max(meeting_date))/(60*60*24) from meetings GROUP BY attendee_id);
Benchmarking
I have created a table with stucture as in the question and populated it with 100000 rows randomly selecting attendee_id
from [1, 10000] and random date from range [1970-01-01, 2017-09-16]. I have benchmarked (with EXPLAIN ANALYZE) queries with the following techniques:
Correlated subquery
SELECT * FROM meetings m1 WHERE m1.meeting_date=
(SELECT max(m2.meeting_date) FROM meetings m2 WHERE m2.attendee_id=m1.attendee_id);
Execution time: 873260.878 ms
Join subquery with group by
SELECT * FROM meetings m
JOIN (SELECT attendee_id, max(meeting_date) from meetings GROUP BY attendee_id) attendee_max_date
ON attendee_max_date.attendee_id = m.attendee_id;</code>
Execution time: 103.427 ms
Use pair (attendee_id, date)
as a key
Concat attendee_id
and meeting_date
as strings
SELECT * FROM meetings WHERE concat(attendee_id, meeting_date) IN
(SELECT concat(attendee_id, max(meeting_date)) from meetings GROUP BY attendee_id);
Execution time: 207.720 ms
Encode attendee_id
and meeting_date
to a single number(code)
SELECT * FROM meetings
WHERE attendee_id*65536 + date_part('epoch',meeting_date)/(60*60*24)
IN (SELECT attendee_id*65536 + date_part('epoch',max(meeting_date))/(60*60*24) from meetings GROUP BY attendee_id);
Execution time: 127.595 ms
Here is a git with table scheme, table data (as csv), code for populating table, and queries.