SQL query to group by maximal sets of a column having inner consecutive distances below a threshold

Question

An added requirement complicates the question. See below

I have a table with 2 columns time and location.

╔════════╦══════════╗
║ time   ║ location ║
╠════════╬══════════╣
║ 9:10   ║  1       ║
║ 9:20   ║ 3        ║
║ 9:35   ║ 3        ║
║ 9:40   ║ 3        ║
║ 10:10  ║  2       ║
║ 10:20  ║  2       ║
║ 10:40  ║  2       ║
║ 11:30  ║ 7        ║
║ 11:50  ║  9       ║
╚════════╩══════════╝

I want to turn it into a new table depicting 'trips', where a trip is ended if there is no activity for an hour (= there is no other row with time that is within the next hour)

That is, there will be a row for each trip with some values calculated (total walking distance, total trip time, etc.)

Like that:

╔═════════╦════════════════════════╗
║ trip_id ║ total_walking_distance ║
╠═════════╬════════════════════════╣
║ 1       ║ 3 (=|1-3|+|3-2|)       ║
║ 2       ║ 7(=|2-7|+|7-9|)        ║
╚═════════╩════════════════════════╝

So I want to 'group by time', but not by distinct values of time but by a more complicated condition. Is it possible without resorting to procedural languages?

Sorting by time seems to make the problem easier, because we only need to know when consecutive rows are more than an hour apart, but after sorting it's still not clear how to make this into a query.

In a programming language it would be just to sort by time and then go sequentially over the rows, and when the time difference is >1 we look from the start of the current trip up to the current row and do our cacluations, and reinitialize the start to the next row. This is not possible, as far as I know, in SQL languages.

Elaboration

If some time-consecutive rows have the same location, I want to only count them once: if one stays in the same place for many consecutive rows only the first one should be in the result.

This was done by first adding the lag() columns of time and location, and removing time-consecutive rows of the same location or the same time, and then partitioning naively.

(Like this query:)

select      time,location
into cleaned_from_duplicate_time_loc
from            (select info.*,
                time - lag(time) over (partition by id order by time)    as diff_time,
                loc- lag(loc) over (partition by id order by time)       as diff_loc
                from info)  
                with_consecutive_differences_of_location_and_time
where (diff_loc is null or diff_loc<>0) and (diff_time>interval '0 hour' or diff_time is null)


select with_trip_start_boolean.*, sum(is_start_of_trip) over(order by id, time) as trip_id
 into with_trip_id
from
    (SELECT auxiliary_table_with_lag_diffs.*,
            case when diff_time> interval '1 hour' or diff_time is null then 1 else 0 end as is_start_of_trip
    FROM 
        (   --adding time_diffs for each id separately
          select cleaned_from_duplicate_time_loc.*,
            time - lag(time) over (partition by id order by time) as diff_time
          from cleaned_from_duplicate_time_loc
        )
     auxiliary_table_with_lag_diffs

    ORDER BY id, time) 
    with_trip_start_boolean

However, there is a complication that makes it wrong: if one stays in the same location for some time and then continues to move and there is less than an hour from the last row in which he was stationary to the next row, then we should count the stationary row twice in the result, both as the end of the previous trip and as the start of the next trip. That makes it impossible to do the cleaning-first approach altogether, because we lose the necessary information of the last 'duplicated' row in case it is less than an hour away from its successor.

What should I do to treat correctly the case where duplicate successive locations both start and end a trip?

Show some sample data and expected result. It would be easier to understand.. Also, ideally, you would post your attempts at resolving this. — Radu Gheorghiu, Aug 17 '15 at 13:29
If I have `LOC1/10:00 - LOC2/10:30 - LOC3/11:00` it means that there was no rest at LOC2 and the trip was immediately continued, yes? If I have `LOC1/10:00 - LOC2/10:30 - LOC2/10:40 - LOC3/11:00` it means that there was a rest of ten minutes at LOC2 which must be subtracted from the total walking time. Yes? And if there is a rest of over an hour, I need bridge records in order to show that the tript wasn't ended, i.e. instead of `LOC1/10:00 - LOC2/10:30 - LOC2/11:40 - LOC3/12:00` which would be 2 trips, I need `LOC1/10:00 - LOC2/10:30 - LOC2/11:00 - LOC2/11:40 - LOC3/12:00`. Correct? — Thorsten Kettner, Aug 26 '15 at 12:42
1&2 correct, third question: on both cases there are 2 trips, and the middle 11:00 `loc2` should not appear in the result. because it entered `loc2` in `10:30` and left it only in `11:40`, more than an hour later. I added a clarifying example in the first table. The `total walking time` is not so important, `total trip time` is just as good and avoids that difficulty. — Keldeo, Aug 26 '15 at 13:10

score 1 · Accepted Answer · answered Aug 17 '15 at 13:40

Basically, what you want is the lag() function with some date arithmetic. Specifically, you want a cumulative sum of the condition that a new trip starts:

select row_number() over (order by min(time)) as trip_num,
       min(time) as trip_start, max(time) as trip_end,
       count(*) as num_stops,
       count(distinct location) as num_locations
from (select sum(case when time > prev_time + interval '1 hour' then 1 else 0 end) over
                 (order by time) as grp
      from (select t.*,
                   lag(time) over (order by time) as prev_time
            from table t
           ) t
     ) t
group by grp;

This query answers the original phrasing, but does not address the added complication of deleting duplicate successive locations while maintaining the reasonable interpretation of a trip. — Keldeo, Sep 02 '15 at 07:15

SQL query to group by maximal sets of a column having inner consecutive distances below a threshold

Elaboration

1 Answers1

Linked