1

I'm working on a dataset consisted of trips (variables: user_id, purpose, start_timeh, end_timeh. Each observation should be one trip; however due to data error, about half of the trips were broken into trip segments. This resulted in multiple observations for one single trips (for example, trip #3 were broken into 5 segments /observations with distinctive trip IDs).

> head(df2)
  trip_id user_id duration_min purpose start_timeh end_timeh fix_needed
1  151203     abc         10.0       7       19:07     19:16          1
2  151204     abc          1.5       7       19:16     19:18          1
3  151206     abc          1.0       7       20:59     21:03          1
4  151207     abc          3.0       7       21:03     21:05          1
5  151208     abc          5.5       7       21:05     21:10          1
6  151210     abc          4.5       2       21:18     21:25          0
> 
> dput(head(df2,4))
structure(list(trip_id = c(151203L, 151204L, 151206L, 151207L
), user_id = structure(c(1L, 1L, 1L, 1L), .Label = "abc", class = "factor"), 
    duration_min = c(10, 1.5, 1, 3), purpose = c(7L, 7L, 7L, 
    7L), start_timeh = structure(c(27L, 29L, 37L, 39L), .Label = c("0:08", 
    "15:50", "15:53", "15:55", "16:01", "16:10", "16:35", "17:04", 
    "17:08", "17:14", "17:25", "17:28", "17:32", "17:34", "17:48", 
    "17:54", "18:14", "18:17", "18:19", "18:28", "18:41", "18:44", 
    "18:47", "18:50", "18:54", "18:56", "19:07", "19:08", "19:16", 
    "19:18", "19:19", "19:23", "19:30", "19:59", "2:12", "2:25", 
    "20:59", "21:00", "21:03", "21:05", "21:18"), class = "factor"), 
    end_timeh = structure(c(28L, 29L, 36L, 37L), .Label = c("0:49", 
    "15:50", "15:53", "15:55", "16:01", "16:04", "16:12", "16:57", 
    "17:06", "17:08", "17:25", "17:32", "17:34", "17:52", "17:55", 
    "17:56", "18:14", "18:16", "18:19", "18:30", "18:44", "18:47", 
    "18:50", "18:54", "18:56", "18:58", "19:10", "19:16", "19:18", 
    "19:19", "19:27", "19:28", "19:32", "2:17", "20:06", "21:03", 
    "21:05", "21:10", "21:25", "21:39", "3:05"), class = "factor"), 
    fix_needed = c(1L, 1L, 1L, 1L)), .Names = c("trip_id", "user_id", 
"duration_min", "purpose", "start_timeh", "end_timeh", "fix_needed"
), row.names = c(NA, 4L), class = "data.frame")

I'd like to combine these trip segments based on three criteria:

  • Same user_id
  • Same trip purpose
  • (Start time of segment i) = (end time of segment i-1).

The final results should look like this:

  • Trip segments are merged into full trips.
  • The duration of the full trips equal to the sum of duration of all segments.
  • The segments should be removed from the dataset.

So I ran this:

v = vector('numeric')
for (i in 2:nrow(df)) {
  if (as.numeric(df$start_timeh[i]) == as.numeric(df$end_timeh[i-1]) &&  df$user_id[i] == df$user_id[i-1] && df$purpose[i] == df$purpose[i-1]) 
  {
    df$duration_min[i]<-df$duration_min[i]+df$duration_min[i-1]
    v <- c(v,i-1)

  }
}
df <- df[-v,]

However the results were not as I expected. A lot of segments were not removed.

Updated: There was an error with the time stamp that created this problem. The code was correct.

GGT
  • 43
  • 6
  • 3
    Please consider producing a reproducible data set that demonstrates what the data actually looks like; along with a similar display for the output. Read more at [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Cristian E. Nuno Feb 15 '18 at 22:40
  • You got my downvote because you did not provide a reproducible example. After you fix this, I will retract my downvote and give you an upvote. – www Feb 15 '18 at 22:47
  • Sorry I was unclear about the rule (and didn't know how to insert a sample table either). Now I've updated. – GGT Feb 15 '18 at 23:14
  • @GGT: please consider using `dput` instead of `head` as suggested in the link posted above – Tung Feb 15 '18 at 23:19
  • Thanks @Tung. I hope it's correct now. – GGT Feb 16 '18 at 01:07
  • I have retracted my downvote and given you an upvote. – www Feb 16 '18 at 01:08
  • @GGT: what is your expected output? Can you add it to the question? – Tung Feb 16 '18 at 09:02
  • @Tung Initially some trip legs were not removed. Thankfully I found an error with the time stamps, so the problem is fixed now! The script is correct though. – GGT Feb 22 '18 at 01:58

0 Answers0