An added requirement complicates the question. See below
I have a table with 2 columns time
and location
.
╔════════╦══════════╗
║ time ║ location ║
╠════════╬══════════╣
║ 9:10 ║ 1 ║
║ 9:20 ║ 3 ║
║ 9:35 ║ 3 ║
║ 9:40 ║ 3 ║
║ 10:10 ║ 2 ║
║ 10:20 ║ 2 ║
║ 10:40 ║ 2 ║
║ 11:30 ║ 7 ║
║ 11:50 ║ 9 ║
╚════════╩══════════╝
I want to turn it into a new table depicting 'trips', where a trip
is ended if there is no activity
for an hour
(= there is no other row with time
that is within the next hour
)
That is, there will be a row for each trip
with some values calculated (total walking distance, total trip time, etc.)
Like that:
╔═════════╦════════════════════════╗
║ trip_id ║ total_walking_distance ║
╠═════════╬════════════════════════╣
║ 1 ║ 3 (=|1-3|+|3-2|) ║
║ 2 ║ 7(=|2-7|+|7-9|) ║
╚═════════╩════════════════════════╝
So I want to 'group by time
', but not by distinct values of time but by a more complicated condition. Is it possible without resorting to procedural languages?
Sorting by time seems to make the problem easier, because we only need to know when consecutive rows are more than an hour apart, but after sorting it's still not clear how to make this into a query.
In a programming language it would be just to sort by time and then go sequentially over the rows, and when the time difference is >1 we look from the start of the current trip up to the current row and do our cacluations, and reinitialize the start to the next row. This is not possible, as far as I know, in SQL languages.
Elaboration
If some time-consecutive rows have the same location, I want to only count them once: if one stays in the same place for many consecutive rows only the first one should be in the result.
This was done by first adding the lag() columns of time and location, and removing time-consecutive rows of the same location or the same time, and then partitioning naively.
(Like this query:)
select time,location
into cleaned_from_duplicate_time_loc
from (select info.*,
time - lag(time) over (partition by id order by time) as diff_time,
loc- lag(loc) over (partition by id order by time) as diff_loc
from info)
with_consecutive_differences_of_location_and_time
where (diff_loc is null or diff_loc<>0) and (diff_time>interval '0 hour' or diff_time is null)
select with_trip_start_boolean.*, sum(is_start_of_trip) over(order by id, time) as trip_id
into with_trip_id
from
(SELECT auxiliary_table_with_lag_diffs.*,
case when diff_time> interval '1 hour' or diff_time is null then 1 else 0 end as is_start_of_trip
FROM
( --adding time_diffs for each id separately
select cleaned_from_duplicate_time_loc.*,
time - lag(time) over (partition by id order by time) as diff_time
from cleaned_from_duplicate_time_loc
)
auxiliary_table_with_lag_diffs
ORDER BY id, time)
with_trip_start_boolean
However, there is a complication that makes it wrong: if one stays in the same location
for some time and then continues to move and there is less than an hour
from the last row in which he was stationary to the next row, then we should count the stationary row twice in the result, both as the end of the previous trip and as the start of the next trip. That makes it impossible to do the cleaning-first approach altogether, because we lose the necessary information of the last 'duplicated' row in case it is less than an hour away from its successor.
What should I do to treat correctly the case where duplicate successive locations both start and end a trip?