0

I have a data set that has information on where people went over a certain number of days period-- the data has three level of nesting in long format. First is person, second is day, third is location. Each row indicates a location. I have information type of location (home, work, etc), travel mode used to get to the location (walk, bike, public bus, etc.), and arrival and departure time. It is a daily travel journal that starts from home and ends at home.
I need to aggregate the data to create get information on the following types of journeys for each person for each day:

1. Journey from home to work without detour. (H-W)
2. Journey from home to work with detour. (H-dt-W) the number of detour does not matter. 
3. Journey work to home without detour. (W-H) 
4. Journey work to home with detour. (W-dt-H) the number of detour does not matter
5. Journey starting from home and ending at home and does not include work in between. (H-O..-H)
6. Journey starting from work and ending at work and does not include home in between. (W-O..-W)

For all these categories I need information of travel mode, and total travel time. For example: Imagine a Monday; a person wakes up at his home(H) has his breakfast and heads for office(W) on his car; on the way he stops at starbuck for a coffee (C) and then pick-ups a co-worker from their home(D). During the day at work, the person goes to visit a client at different location (E) and comes back to work; this time he takes train. The person then leave for home early that day because he needs to go for grocery. so the person return home, and goes for grocery at location(F), and comes back home, this time walked to the grocery store. This person made different kinds of journeys: 1)H-dt(C-D)-W, 2)W-O(E)-W, 3)W-H, 5)H-O(F)-H. He used different modes for the journey, 1) driving, 2)train, 3)walk. We can add travel time to each location as well using arrival and departure time. Below is the tabular form of the data. (The data below is only for a day for a person, but my data has more days and people).

    ###Data I have
Person   Day ID     Place   Location_Code   Mode    Arrive      Depart
   5        1       0           H           NA      NA          8:00:00 AM
   5        1       1           C           D       8:30:00 AM  9:30:00 AM
   5        1       2           D           D       10:00:00 AM 11:00:00 AM
   5        1       3           W           D       11:30:00 AM 12:00:00 PM
   5        1       4           E           T       1:00:00 PM  1:30:00 PM
   5        1       5           W           T       2:30:00 PM  3:45:00 PM
   5        1       6           H           D       4:00:00 PM  4:30:00 PM
   5        1       7           F           P       5:00:00 PM  6:00:00 PM
   5        1       8           H           P       7:00:00 PM  NA


###Data I want
Person  Day     Journey Type    Mode/s  Travel Time(hr)     
5       1       H-dt-W          DDD         1.5     
5       1       W-O-W           TT          2       
5       1       W-H             D           0.25        
5       1       H-O-H           PP          1.5

I have also enter image description hereattached a picture of the data as I have and the data I want to have.

009
  • 35
  • 3

1 Answers1

1

Here is a solution using functions from tidyverse, data.table, lubridate, and stringr. dt6 is the final output. Notice that dt6 is exactly the same as your desired output except the Journey Type column because I don't know the logic and meening of your coding (like Why H-C-D-W is H-dt(C-D)-W?). I just combined all information. You can change the coding based on your preference.

# Load package
library(tidyverse)
library(data.table)
library(lubridate)
library(stringr)

Data Preparation

# Create example data frame
dt <- read.table(text = "Person   'Day ID'     Place   Location_Code   Mode    Arrive      Depart
5        1       0           H           NA      NA          '8:00:00 AM'
5        1       1           C           D       '8:30:00 AM'  '9:30:00 AM'
5        1       2           D           D       '10:00:00 AM' '11:00:00 AM'
5        1       3           W           D       '11:30:00 AM' '12:00:00 PM'
5        1       4           E           T       '1:00:00 PM'  '1:30:00 PM'
5        1       5           W           T       '2:30:00 PM'  '3:45:00 PM'
5        1       6           H           D       '4:00:00 PM'  '4:30:00 PM'
5        1       7           F           P       '5:00:00 PM'  '6:00:00 PM'
5        1       8           H           P       '7:00:00 PM'  NA",
                 header = TRUE, stringsAsFactors = FALSE)

Step 1: Convert Arrive and Depart to date time class

The year and month, 2000-01, does not matter if all your movement events happend on the same date. I just added them to be easier convert to date time class.

dt2 <- dt %>%
  mutate(Arrive = ymd_hms(paste0("2000-01-", Day.ID, " ", Arrive)),
         Depart = ymd_hms(paste0("2000-01-", Day.ID, " ", Depart))) 

Step 2: Convert data frame from wide format to long format based on Arrive and Depart. Create a MoveID, which is one lag difference to the Place column.

dt3 <- dt2 %>%
  # Convert to long format
  gather(Action, Time, Arrive, Depart) %>%
  arrange(Person, Day.ID, Place, Location_Code, Action) %>%
  group_by(Person, Day.ID, Place, Location_Code) %>%
  # Create a Moving ID
  mutate(MoveID = lag(Place)) %>%
  ungroup() %>%
  fill(MoveID, .direction = "down") 

Step 3: Calculate time difference between departure and arrive per MoveID

dt4 <- dt3 %>%
  # Calculate time difference 
  group_by(Person, Day.ID, MoveID) %>%
  summarise(Travel_Time = difftime(dplyr::last(Time), dplyr::first(Time),
                                   units = "hours")) %>%
  ungroup() %>%
  select(MoveID, Travel_Time) %>%
  right_join(dt3, by = "MoveID") 

Step 4: Shift the Travel_Time by one. Create a Run Length ID based on Mode.

dt5 <- dt4 %>%
  mutate(Travel_Time = lag(Travel_Time)) %>%
  mutate(RunID = rleid(Mode)) %>%
  group_by(Person, Day.ID, Place) %>%
  slice(1) %>%
  select(-Action, -Time) %>%
  ungroup()

Step 5: Create all the desired column

dt6 <- dt5 %>%
  group_by(Person, Day.ID, RunID) %>%
  summarise(Travel_Time_Sum = sum(Travel_Time), 
            Mode_Sum = paste(Mode, collapse = ""),
            Journey = paste(Location_Code, collapse = "-")) %>%
  mutate(Journey = paste(str_sub(lag(Journey), start = -1, end = -1), 
                         Journey, sep = "-")) %>%
  # Remove any columns with NA in Travel_Time_Sum
  drop_na(Travel_Time_Sum) %>%
  select(Person, Day = Day.ID, `Journey Type` = Journey, `Mode/s` = Mode_Sum, 
         `Travel Time(hr)` = Travel_Time_Sum)
www
  • 38,575
  • 12
  • 48
  • 84
  • Thanks for the suggestion @ycw. The code works good! However, when I try it with my larger dataset, it seems to act weird. I tested the code with travel information for two days for same person and it acted weird as well. It works fine when the travel pattern for day 1 and day 2 are exact the same, but does not if the pattern or number of location are different. This might be solved by simply modifying the code you have suggested but I an a bit new to R. Could you suggest how to solve the issue? Thanks again! – 009 Jun 29 '17 at 18:58
  • It may be easier for others to help if you post a new question with a good reproducible example and desired output. I cannot help you because I cannot see your day 2 data. If this solution solves the current example dataset, please accept it. – www Jun 29 '17 at 19:02
  • Thanks @ycw. I will post new question with the different example. I appreciate your help on solving this problem. – 009 Jun 29 '17 at 19:13
  • I am glad to help. I will visit your new post to see if I can provide further assistance. You can accept this answer by marking the green arrows on the top left corner of this answer. – www Jun 29 '17 at 19:15