The real problem is with some huge product catalogs. In order to learn how to SQL this properly, I'm testing with this dataset. The jumbled up order of things is intended to help ensure that the logic will be sound.
The following is a listing of the test data.
select * from test;
id | title | time |
---|---|---|
234 | 234_3 | 3 |
999 | 999_11 | 11 |
123 | 123_4 | 4 |
234 | 234_9 | 9 |
123 | 123_2.3 | 2.3 |
999 | 999_1.2 | 1.2 |
999 | 999_5 | 5 |
99 | 99_7 | 7 |
999 | 999_6 | 6 |
10023 | shoe | 3 |
10023 | shoe | 4 |
10023 | NOT A SHOE | 5 |
10023 | NOT A SHOE | 6 |
10024 | shirt | 2 |
10024 | garbage | 3 |
10024 | shirt | 4 |
234 | 234_1 | 1 |
As you can see the title here are partially named to help correlate with the other values.
The desired output is the following:
id | t_start | t_finish | title_start | title_finish |
---|---|---|---|---|
99 | 7 | 7 | 99_7 | 99_7 |
123 | 2.3 | 4 | 123_2.3 | 123_4 |
234 | 1 | 9 | 234_1 | 234_9 |
999 | 1.2 | 11 | 999_1.2 | 999_11 |
10023 | 3 | 6 | shoe | NOT A SHOE |
10024 | 2 | 4 | shirt | shirt |
At first I tried using window functions. I started with this to understand how the window function works by using an aggregate function to display the contents of the "window".
select string_agg(title, ', ') over (partition by id order by time) from test;
string_agg |
---|
99_7 |
123_2.3 |
123_2.3, 123_4 |
234_1 |
234_1, 234_3 |
234_1, 234_3, 234_9 |
999_1.2 |
999_1.2, 999_5 |
999_1.2, 999_5, 999_6 |
999_1.2, 999_5, 999_6, 999_11 |
shoe |
shoe, shoe |
shoe, shoe, NOT A SHOE |
shoe, shoe, NOT A SHOE, NOT A SHOE |
shirt |
shirt, garbage |
shirt, garbage, shirt |
This indicates the desired application of grouping by id and ordering by time. I have gone into this result table and bolded the information that I actually need.
Unfortunately I am at a complete loss as to how to:
- only choose the last of these partition windows
- choose only the first and last items
Furthermore:
- prefer avoiding stitching strings only to split them back out: real titles can have commas, or indeed any separator I might otherwise end up choosing, inside.
A second approach I went ahead and took is the below:
select
d.id,
d.title as title_start,
test.title as title_finish,
start,
finish
from
(
select
c.id,
title,
start,
finish
from (
(
select min(time) as start, id from test group by id
) a
inner join
(
select max(time) as finish, id from test group by id
) b
using (id)
) c
left join test
on c.id = test.id
where c.start = test.time
) d
left join test
on d.id = test.id
and d.finish = test.time
order by id;
id | title_start | title_finish | start | finish |
---|---|---|---|---|
99 | 99_7 | 99_7 | 7 | 7 |
123 | 123_2.3 | 123_4 | 2.3 | 4 |
234 | 234_1 | 234_9 | 1 | 9 |
999 | 999_1.2 | 999_11 | 1.2 | 11 |
10023 | shoe | NOT A SHOE | 3 | 6 |
10024 | shirt | shirt | 2 | 4 |
Here I was able to get all the way there to my desired result by brute forcing joins. But this SQL code is extremely difficult to follow. If I saw this presented to me as a solution I would feel pretty strongly about rejecting it. I would really like it if a SQL expert could review this and tell me the actual right way to code up this problem. Surely three joins are not needed?