Summing data across row based on timestamps

Question

I'm working on a dataset consisted of trips (variables: user_id, purpose, start_timeh, end_timeh. Each observation should be one trip; however due to data error, about half of the trips were broken into trip segments. This resulted in multiple observations for one single trips (for example, trip #3 were broken into 5 segments /observations with distinctive trip IDs).

> head(df2)
  trip_id user_id duration_min purpose start_timeh end_timeh fix_needed
1  151203     abc         10.0       7       19:07     19:16          1
2  151204     abc          1.5       7       19:16     19:18          1
3  151206     abc          1.0       7       20:59     21:03          1
4  151207     abc          3.0       7       21:03     21:05          1
5  151208     abc          5.5       7       21:05     21:10          1
6  151210     abc          4.5       2       21:18     21:25          0
>

> dput(head(df2,4))
structure(list(trip_id = c(151203L, 151204L, 151206L, 151207L
), user_id = structure(c(1L, 1L, 1L, 1L), .Label = "abc", class = "factor"), 
    duration_min = c(10, 1.5, 1, 3), purpose = c(7L, 7L, 7L, 
    7L), start_timeh = structure(c(27L, 29L, 37L, 39L), .Label = c("0:08", 
    "15:50", "15:53", "15:55", "16:01", "16:10", "16:35", "17:04", 
    "17:08", "17:14", "17:25", "17:28", "17:32", "17:34", "17:48", 
    "17:54", "18:14", "18:17", "18:19", "18:28", "18:41", "18:44", 
    "18:47", "18:50", "18:54", "18:56", "19:07", "19:08", "19:16", 
    "19:18", "19:19", "19:23", "19:30", "19:59", "2:12", "2:25", 
    "20:59", "21:00", "21:03", "21:05", "21:18"), class = "factor"), 
    end_timeh = structure(c(28L, 29L, 36L, 37L), .Label = c("0:49", 
    "15:50", "15:53", "15:55", "16:01", "16:04", "16:12", "16:57", 
    "17:06", "17:08", "17:25", "17:32", "17:34", "17:52", "17:55", 
    "17:56", "18:14", "18:16", "18:19", "18:30", "18:44", "18:47", 
    "18:50", "18:54", "18:56", "18:58", "19:10", "19:16", "19:18", 
    "19:19", "19:27", "19:28", "19:32", "2:17", "20:06", "21:03", 
    "21:05", "21:10", "21:25", "21:39", "3:05"), class = "factor"), 
    fix_needed = c(1L, 1L, 1L, 1L)), .Names = c("trip_id", "user_id", 
"duration_min", "purpose", "start_timeh", "end_timeh", "fix_needed"
), row.names = c(NA, 4L), class = "data.frame")

I'd like to combine these trip segments based on three criteria:

Same user_id
Same trip purpose
(Start time of segment i) = (end time of segment i-1).

The final results should look like this:

Trip segments are merged into full trips.
The duration of the full trips equal to the sum of duration of all segments.
The segments should be removed from the dataset.

So I ran this:

v = vector('numeric')
for (i in 2:nrow(df)) {
  if (as.numeric(df$start_timeh[i]) == as.numeric(df$end_timeh[i-1]) &&  df$user_id[i] == df$user_id[i-1] && df$purpose[i] == df$purpose[i-1]) 
  {
    df$duration_min[i]<-df$duration_min[i]+df$duration_min[i-1]
    v <- c(v,i-1)

  }
}
df <- df[-v,]

However the results were not as I expected. A lot of segments were not removed.

Updated: There was an error with the time stamp that created this problem. The code was correct.

Please consider producing a reproducible data set that demonstrates what the data actually looks like; along with a similar display for the output. Read more at [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Cristian E. Nuno, Feb 15 '18 at 22:40
You got my downvote because you did not provide a reproducible example. After you fix this, I will retract my downvote and give you an upvote. — www, Feb 15 '18 at 22:47
Sorry I was unclear about the rule (and didn't know how to insert a sample table either). Now I've updated. — GGT, Feb 15 '18 at 23:14
@GGT: please consider using `dput` instead of `head` as suggested in the link posted above — Tung, Feb 15 '18 at 23:19
@GGT: what is your expected output? Can you add it to the question? — Tung, Feb 16 '18 at 09:02
@Tung Initially some trip legs were not removed. Thankfully I found an error with the time stamps, so the problem is fixed now! The script is correct though. — GGT, Feb 22 '18 at 01:58

Summing data across row based on timestamps

0 Answers0