2

I have a dataset that looks like this.

id1 = c(1,1,1,1,1,1,1,1,2,2)
id2 = c(3,3,3,3,3,3,3,3,3,3)
lat = c(-62.81559,-62.82330, -62.78693,-62.70136, -62.76476,-62.48157,-62.49064,-62.45838,42.06258,42.06310)
lon = c(-61.15518, -61.14885,-61.17801,-61.00363, -59.14270, -59.22009, -59.32967, -59.04125 ,154.70579, 154.70625)
start_date= as.POSIXct(c('2016-03-24 15:30:00', '2016-03-24 15:30:00','2016-03-24 23:40:00','2016-03-25 12:50:00','2016-03-29 18:20:00','2016-06-01 02:40:00','2016-06-01 08:00:00','2016-06-01 16:30:00','2016-07-29 20:20:00','2016-07-29 20:20:00'), tz = 'UTC')
end_date = as.POSIXct(c('2016-03-24 23:40:00', '2016-03-24 18:50:00','2016-03-25 03:00:00','2016-03-25 19:20:00','2016-04-01 03:30:00','2016-06-02 01:40:00','2016-06-01 14:50:00','2016-06-02 01:40:00','2016-07-30 07:00:00','2016-07-30 07:00:00'),tz = 'UTC')
speed = c(2.9299398, 2.9437502, 0.0220565, 0.0798409, 1.2824859, 1.8685429, 3.7927680, 1.8549291, 0.8140249,0.8287073)
df = data.frame(id1, id2, lat, lon, start_date, end_date, speed)

id1 id2       lat       lon          start_date            end_date     speed
1    1   3 -62.81559 -61.15518 2016-03-24 15:30:00 2016-03-24 23:40:00 2.9299398
2    1   3 -62.82330 -61.14885 2016-03-24 15:30:00 2016-03-24 18:50:00 2.9437502
3    1   3 -62.78693 -61.17801 2016-03-24 23:40:00 2016-03-25 03:00:00 0.0220565
4    1   3 -62.70136 -61.00363 2016-03-25 12:50:00 2016-03-25 19:20:00 0.0798409
5    1   3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
6    1   3 -62.48157 -59.22009 2016-06-01 02:40:00 2016-06-02 01:40:00 1.8685429
7    1   3 -62.49064 -59.32967 2016-06-01 08:00:00 2016-06-01 14:50:00 3.7927680
8    1   3 -62.45838 -59.04125 2016-06-01 16:30:00 2016-06-02 01:40:00 1.8549291
9    2   3  42.06258 154.70579 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8140249
10   2   3  42.06310 154.70625 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8287073

The actual dataset is larger. What I would like to do is consolidate this dataset based on date ranges and grouped by id1 and id2, such that if the date/time range on one row is within 12 hours of the next date/time range 'ABS(end_date[1] - start_date[2]) < 12hrs' the rows should be consolidated with the new start_date being the earliest date and the end_date being the latest. All other values (lat, lon, speed) will be averaged. This is some sense a 'deduping' effort as rows that are within 12 hours actually represent the same 'event'. For the above example the final result would be

id1 id2       lat       lon          start_date            end_date     speed
1    1   3 -62.7818  -61.12142 2016-03-24 15:30:00 2016-03-25 19:20:00 1.493897
2    1   3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
3    1   3 -62.47686 -59.197   2016-06-01 02:40:00 2016-06-02 01:40:00 2.505413
4    2   3  42.06284 154.706   2016-07-29 20:20:00 2016-07-30 07:00:00 0.8213661

With the first four rows consolidated (into row1), the 5 row left alone (row2), the 6-8 rows consolidated (row3), and the 9-10 rows consolidated (row4).

I have been trying to do this with dplyr group_by and summarize, but I can't seem to get the get the date ranges to come out correctly.

Hopefully someone can determine a simple means of solving the problem. Extra points if you know how to do it in SQL ;-) so I can dedupe before even pulling this into R.

lmo
  • 37,904
  • 9
  • 56
  • 69
Nate Miller
  • 386
  • 5
  • 19
  • 1
    [Collapse rows with overlapping ranges](http://stackoverflow.com/questions/41747742/collapse-rows-with-overlapping-ranges), [collapse intersecting regions in R](http://stackoverflow.com/questions/16957293/collapse-intersecting-regions-in-r) should get you going. – Henrik Feb 07 '17 at 07:29

2 Answers2

0

Here is a first very naive implementation. Warning: it is slow, not pretty and still missing the start and end dates in the output! Note that it expects the rows to be ordered by date and time. If that's not the case in the data set, you can do it in R or SQL first. Sorry that I can't think of a dplyr or SQL solution. I'd also like to see those two, if anyone has got an idea.

  dedupe <- function(df) {
  counter = 1
  temp_vector = unlist(df[1, ])
  summarized_df = df[0, c(1, 2, 3, 4, 7)]
  colnames(summarized_df) = colnames(df)[c(1, 2, 3, 4, 7)]
  summarized_df$counter = NULL
  for (i in 2:nrow(df)) {
    if (((abs(difftime(df[i, "start_date"], df[i - 1, "end_date"], units = "h")) <
          12) ||
         abs(difftime(df[i, "start_date"], df[i - 1, "start_date"], units = "h")) <
         12) &&
        df[i, "id1"] == df[i - 1, "id1"] &&
        df[i, "id2"] == df[i - 1, "id2"]) {
      #group events because id is the same and time range overlap
      #sum up columns and select maximum end_date
      temp_vector[c(3, 4, 7)] = temp_vector[c(3, 4, 7)] + unlist(df[i, c(3, 4, 7)])
      temp_vector["end_date"] = max(temp_vector["end_date"], df[i, "end_date"])
      counter = counter + 1
      if (i == nrow(df)) {
        #in the last iteration we need to create a new group
        summarized_df[nrow(summarized_df) + 1, c(1, 2)] = df[i, c(1, 2)]
        summarized_df[nrow(summarized_df), 3:5] = temp_vector[c(3, 4, 7)] / counter
        summarized_df[nrow(summarized_df), "counter"] = counter
      }
    } else {
      #new event so we calculate group statistics for temp_vector and reset its value as well as counter
      summarized_df[nrow(summarized_df) + 1, c(1, 2)] = df[i, c(1, 2)]
      summarized_df[nrow(summarized_df), 3:5] = temp_vector[c(3, 4, 7)] / counter
      summarized_df[nrow(summarized_df), "counter"] = counter
      counter = 1
      temp_vector[c(3, 4, 7)] = unlist(df[i, c(3, 4, 7)])
    }
  }
  return(summarized_df)
}

Function call

> dedupe(df)
   id1 id2       lat       lon     speed counter
5    1   3 -62.78179 -61.12142 1.4938968       4
6    1   3 -62.76476 -59.14270 1.2824859       1
9    2   3 -62.47686 -59.19700 2.5054133       3
10   2   3  42.06284 154.70602 0.8213661       2
Juergen
  • 312
  • 3
  • 18
0

This can be easily achieved by using insurancerating::reduce():

df |>
  insurancerating::reduce(begin = start_date, end = end_date, id1, id2, 
                          agg_cols = c(lat, lon, speed), agg = "mean", 
                          min.gapwidth = 12 * 3600)
#>   id1 id2 index            end_date          start_date       lat       lon
#> 1   1   3     0 2016-03-25 19:20:00 2016-03-24 15:30:00 -62.78180 -61.12142
#> 2   1   3     1 2016-04-01 03:30:00 2016-03-29 18:20:00 -62.76476 -59.14270
#> 3   1   3     2 2016-06-02 01:40:00 2016-06-01 02:40:00 -62.47686 -59.19700
#> 4   2   3     0 2016-07-30 07:00:00 2016-07-29 20:20:00  42.06284 154.70602
#>       speed
#> 1 1.4938969
#> 2 1.2824859
#> 3 2.5054133
#> 4 0.8213661

Created on 2022-06-13 by the reprex package (v2.0.1)

mharinga
  • 1,708
  • 10
  • 23