I'm working on a dataset consisted of trips (variables: user_id, purpose, start_timeh, end_timeh
. Each observation should be one trip; however due to data error, about half of the trips were broken into trip segments. This resulted in multiple observations for one single trips (for example, trip #3 were broken into 5 segments /observations with distinctive trip IDs).
> head(df2)
trip_id user_id duration_min purpose start_timeh end_timeh fix_needed
1 151203 abc 10.0 7 19:07 19:16 1
2 151204 abc 1.5 7 19:16 19:18 1
3 151206 abc 1.0 7 20:59 21:03 1
4 151207 abc 3.0 7 21:03 21:05 1
5 151208 abc 5.5 7 21:05 21:10 1
6 151210 abc 4.5 2 21:18 21:25 0
>
> dput(head(df2,4))
structure(list(trip_id = c(151203L, 151204L, 151206L, 151207L
), user_id = structure(c(1L, 1L, 1L, 1L), .Label = "abc", class = "factor"),
duration_min = c(10, 1.5, 1, 3), purpose = c(7L, 7L, 7L,
7L), start_timeh = structure(c(27L, 29L, 37L, 39L), .Label = c("0:08",
"15:50", "15:53", "15:55", "16:01", "16:10", "16:35", "17:04",
"17:08", "17:14", "17:25", "17:28", "17:32", "17:34", "17:48",
"17:54", "18:14", "18:17", "18:19", "18:28", "18:41", "18:44",
"18:47", "18:50", "18:54", "18:56", "19:07", "19:08", "19:16",
"19:18", "19:19", "19:23", "19:30", "19:59", "2:12", "2:25",
"20:59", "21:00", "21:03", "21:05", "21:18"), class = "factor"),
end_timeh = structure(c(28L, 29L, 36L, 37L), .Label = c("0:49",
"15:50", "15:53", "15:55", "16:01", "16:04", "16:12", "16:57",
"17:06", "17:08", "17:25", "17:32", "17:34", "17:52", "17:55",
"17:56", "18:14", "18:16", "18:19", "18:30", "18:44", "18:47",
"18:50", "18:54", "18:56", "18:58", "19:10", "19:16", "19:18",
"19:19", "19:27", "19:28", "19:32", "2:17", "20:06", "21:03",
"21:05", "21:10", "21:25", "21:39", "3:05"), class = "factor"),
fix_needed = c(1L, 1L, 1L, 1L)), .Names = c("trip_id", "user_id",
"duration_min", "purpose", "start_timeh", "end_timeh", "fix_needed"
), row.names = c(NA, 4L), class = "data.frame")
I'd like to combine these trip segments based on three criteria:
- Same user_id
- Same trip purpose
- (Start time of segment i) = (end time of segment i-1).
The final results should look like this:
- Trip segments are merged into full trips.
- The duration of the full trips equal to the sum of duration of all segments.
- The segments should be removed from the dataset.
So I ran this:
v = vector('numeric')
for (i in 2:nrow(df)) {
if (as.numeric(df$start_timeh[i]) == as.numeric(df$end_timeh[i-1]) && df$user_id[i] == df$user_id[i-1] && df$purpose[i] == df$purpose[i-1])
{
df$duration_min[i]<-df$duration_min[i]+df$duration_min[i-1]
v <- c(v,i-1)
}
}
df <- df[-v,]
However the results were not as I expected. A lot of segments were not removed.
Updated: There was an error with the time stamp that created this problem. The code was correct.