I have a dataset that looks like this.
id1 = c(1,1,1,1,1,1,1,1,2,2)
id2 = c(3,3,3,3,3,3,3,3,3,3)
lat = c(-62.81559,-62.82330, -62.78693,-62.70136, -62.76476,-62.48157,-62.49064,-62.45838,42.06258,42.06310)
lon = c(-61.15518, -61.14885,-61.17801,-61.00363, -59.14270, -59.22009, -59.32967, -59.04125 ,154.70579, 154.70625)
start_date= as.POSIXct(c('2016-03-24 15:30:00', '2016-03-24 15:30:00','2016-03-24 23:40:00','2016-03-25 12:50:00','2016-03-29 18:20:00','2016-06-01 02:40:00','2016-06-01 08:00:00','2016-06-01 16:30:00','2016-07-29 20:20:00','2016-07-29 20:20:00'), tz = 'UTC')
end_date = as.POSIXct(c('2016-03-24 23:40:00', '2016-03-24 18:50:00','2016-03-25 03:00:00','2016-03-25 19:20:00','2016-04-01 03:30:00','2016-06-02 01:40:00','2016-06-01 14:50:00','2016-06-02 01:40:00','2016-07-30 07:00:00','2016-07-30 07:00:00'),tz = 'UTC')
speed = c(2.9299398, 2.9437502, 0.0220565, 0.0798409, 1.2824859, 1.8685429, 3.7927680, 1.8549291, 0.8140249,0.8287073)
df = data.frame(id1, id2, lat, lon, start_date, end_date, speed)
id1 id2 lat lon start_date end_date speed
1 1 3 -62.81559 -61.15518 2016-03-24 15:30:00 2016-03-24 23:40:00 2.9299398
2 1 3 -62.82330 -61.14885 2016-03-24 15:30:00 2016-03-24 18:50:00 2.9437502
3 1 3 -62.78693 -61.17801 2016-03-24 23:40:00 2016-03-25 03:00:00 0.0220565
4 1 3 -62.70136 -61.00363 2016-03-25 12:50:00 2016-03-25 19:20:00 0.0798409
5 1 3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
6 1 3 -62.48157 -59.22009 2016-06-01 02:40:00 2016-06-02 01:40:00 1.8685429
7 1 3 -62.49064 -59.32967 2016-06-01 08:00:00 2016-06-01 14:50:00 3.7927680
8 1 3 -62.45838 -59.04125 2016-06-01 16:30:00 2016-06-02 01:40:00 1.8549291
9 2 3 42.06258 154.70579 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8140249
10 2 3 42.06310 154.70625 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8287073
The actual dataset is larger. What I would like to do is consolidate this dataset based on date ranges and grouped by id1 and id2, such that if the date/time range on one row is within 12 hours of the next date/time range 'ABS(end_date[1] - start_date[2]) < 12hrs' the rows should be consolidated with the new start_date being the earliest date and the end_date being the latest. All other values (lat, lon, speed) will be averaged. This is some sense a 'deduping' effort as rows that are within 12 hours actually represent the same 'event'. For the above example the final result would be
id1 id2 lat lon start_date end_date speed
1 1 3 -62.7818 -61.12142 2016-03-24 15:30:00 2016-03-25 19:20:00 1.493897
2 1 3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
3 1 3 -62.47686 -59.197 2016-06-01 02:40:00 2016-06-02 01:40:00 2.505413
4 2 3 42.06284 154.706 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8213661
With the first four rows consolidated (into row1), the 5 row left alone (row2), the 6-8 rows consolidated (row3), and the 9-10 rows consolidated (row4).
I have been trying to do this with dplyr
group_by
and summarize
, but I can't seem to get the get the date ranges to come out correctly.
Hopefully someone can determine a simple means of solving the problem. Extra points if you know how to do it in SQL ;-) so I can dedupe before even pulling this into R.