7

I am new to R, but after taking an intro course and playing with it a bit, I'm hopeful that it can 1) more elegantly solve my modelling objectives (compared to Excel, which is my backup plan) and 2) be a useful skill to take away from this project.

The task/objective:

I am attempting to use driving diary data to simulate and model potential energy and GHG emissions from electric cars. Specifically:

  1. I have driving diary data (start and end time stamps, plus other data of thousands of drivers -- basic sample below) that I want to translate into:
  2. 24-hour time series data, such that for each minute of a 24-hour period, I know exactly who is driving a vehicle, and what 'trip' that it belongs to (for that driver). My problem here focuses on this issue.

The type of output I would like: NOTE: this output is NOT related to the sample data provided below. I used the first ten minutes of a certain day with some theoretical trips just as an example

enter image description here

Not essential to this problem, but may be useful to know: I will use the above output to cross-reference other driver-specific data to calculate minute-by-minute consumption of gasoline (or electricity) based on things associated with that trip, such as parking location or trip distance. I would like to do this in R but must first figure out the above problem before I move onto this step.

The solution I have so far is based on:

The problem:

Example simplified data:

a <- c("A","A","A","B","B","B","C","C","C")
b <- c(1, 2, 3, 1, 2, 3, 1, 2, 3)
c <- as.POSIXct(c(0.29167, 0.59375, 0.83333, 0.45833, 0.55347, 0.27083, 0.34375, 0.39236, 0.35417)*24*3600 + as.POSIXct("2013-1-1 00:00") )
d <- as.POSIXct(c(0.334027778, 0.614583333, 0.875, 0.461805556, 0.563888889, 0.295138889, 0.375, 0.503472222, 0.364583333)*24*3600 + as.POSIXct("2013-1-1 00:00"))
e <- c(2, 8, 2, 5, 5, 2, 5, 5, 2)
f <- as.POSIXct(c(0, 0.875, 0, 0.479166666666667, 0.580555555555556, 0.489583333333333, 0.430555555555556, 0.541666666666667, 0.711805555555555)*24*3600 + as.POSIXct("2013-1-1 00:00"))
g <- as.POSIXct(c(0, 0.885, 0, 0.482638888888889, 0.588194444444444, 0.496527777777778, 0.454861111111111, 0.559027777777778, 0.753472222222222)*24*3600 + as.POSIXct("2013-1-1 00:00"))
h <- c(0, 1, 0, 1, 4, 8, 8, 1, 5)
i <- as.POSIXct(c(0, 0, 0, 0.729166666666667, 0.595833333333333, 0.534722222222222, 0.59375, 0.779861111111111, 0.753472222222222)*24*3600 + as.POSIXct("2013-1-1 00:00"))
j <- as.POSIXct(c(0, 0, 0, 0.736111111111111, 0.605555555555556, 0.541666666666667, 0.611111111111111, 0.788194444444445, 0.75625)*24*3600 + as.POSIXct("2013-1-1 00:00"))
k <- c(0, 0, 0, 4, 4, 2, 5, 8,1)
testdata <- data.frame(a,b,c,d,e,f,g,h,i,j,k)
names(testdata) <- c("id", "Day", "trip1_start", "trip1_end", "trip1_purpose", "trip2_start", "trip2_end", "trip2_purpose", "trip3_start", "trip3_end", "trip3_purpose")

In this example data, I have three drivers (id = A, B, C) who each drive on three different days (day = 1, 2, 3). Note that some drivers may have different numbers of trips. The time stamps indicate start and end time of driving activities.

I then create minute intervals for a entire day (January 1, 2013)

start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
end.max <- as.POSIXct("2013-01-01 23:59:59 PST")
tinterval <- seq.POSIXt(start.min, end.max, na.rm=T, by = "mins")

Insert "1" during minutes where a given user is driving:

out1 <- xts(,align.time(tinterval,60))
# loop over each user
for(i in 1:NROW(testdata)) {
  # paste the start / end times into an xts-style range
  timeRange <- paste(format(testdata[i,c("trip1_start","trip1_end")]),collapse="/")
  # add the minute "by parameter" for timeBasedSeq
  timeRange <- paste(timeRange,"M",sep="/")
  # create the by-minute sequence and align to minutes to match "out"
  timeSeq <- align.time(timeBasedSeq(timeRange),60)
  # create xts object with "1" entries for times between start and end
  temp1 <- xts(rep(1,length(timeSeq)),timeSeq)
  # merge temp1 with out and fill non-matching timestamps with "0"
  out1 <- merge(out1, temp1, fill=0)
}
# add column names
colnames(out1) <- paste(testdata[,1], testdata[,2], sep = ".")

The idea is to then repeat this for each trip, e.g. out2, out3, etc. wherein I would fill any driving periods with "2", "3", etc., and then sum/merge all of the resulting outx dataframes, and eventually get the desired result.

Unfortunately when I try to repeat this for out2...

out2 <- xts(,align.time(tinterval,60))
for(i in 1:NROW(testdata)) {
  timeRange2 <- paste(format(testdata[i,c("trip2_start","trip2_end")]),collapse="/")
  timeRange2 <- paste(timeRange2,"M",sep="/")
  timeSeq2 <- align.time(timeBasedSeq(timeRange2),60)
  temp2 <- xts(rep(2,length(timeSeq2)),timeSeq2)
  out2 <- merge(out2, temp2, fill=0)
}
colnames(out2) <- paste(testdata[,1], testdata[,2], sep = ".")
head(out2)

I get the following errors:

  • Error in UseMethod("align.time") : no applicable method for 'align.time' applied to an object of class "Date"
  • Error in colnames<-(*tmp*, value = c("A.1", "A.2", "A.3", "B.1", "B.2", : attempt to set 'colnames' on an object with less than two dimensions

What is wrong with my code for out2?

Are there any other better solutions or packages I can learn about?

I realize this is probably a very roundabout way to get to my desired output.

Any help would be much appreciated.

Community
  • 1
  • 1
George K
  • 73
  • 7

2 Answers2

1

In this solution I read your original data and format it to get the generated data of my previous answer. The data provided is limited to 22 trips by driver, but the reshaping here is not limited by the number of trips. The idea is similar to the one used to generate sample data. I am using data.table since it is handy to manipulate data per group.

So for each(day,driver) I do the following:

  1. create a vector of zeros of length the number of minutes
  2. read start and end position using XXXstrip_start and XXXstrip_end.
  3. create sequence seq(start,end)
  4. use this sequence to change zeros by a sequence of number

Here my code:

start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
hours.min <- format(seq(start.min, 
                        length.out=24*60, by = "mins"),
                    '%H:%M')
library(data.table)
diary <- read.csv("samplediary.csv",
                  stringsAsFactors=FALSE)
DT <- data.table(diary,key=c('id','veh_assigned','day'))

dat <- DT[, as.list({ .SD;nb.trip=sum_trips
           tripv <- vector(mode='integer',length(hours.min))
           if(sum_trips>0){
             starts = mget(paste0('X',seq(nb.trip),'_trip_start'))
             ends = mget(paste0('X',seq(nb.trip),'_trip_end'))
             ids <- mapply(function(x,y){
                                        seq(as.integer(x),as.integer(y))},
                           starts,ends,SIMPLIFY = FALSE)
             for (x in seq_along(ids))tripv[ids[[x]]] <- x
             }
            tripv
           }),
   by=c('id','day')]
setnames(x=dat,old=paste0('V',seq(hours.min)),hours.min)

Here what you get for if you subset the 10 first variables :

dat[1:10,1:10,with=FALSE]


       id day 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07
 1: 3847339   1     0     0     0     0     0     0     0     0
 2: 3847384   1     0     0     0     0     0     0     0     0
 3: 3847436   1     0     0     0     0     0     0     0     0
 4: 3847439   1     0     0     0     0     0     0     0     0
 5: 3847510   1     0     0     0     0     0     0     0     0
 6: 3847536   1     0     0     0     0     0     0     0     0
 7: 3847614   1     0     0     0     0     0     0     0     0
 8: 3847683   1     0     0     0     0     0     0     0     0
 9: 3847841   1     0     0     0     0     0     0     0     0
10: 3847850   1     0     0     0     0     0     0     0     0

One idea is to create a heatmap of your data ( at least per day) to get some intutions and see overlapping drivers for example. Here 2 ways to do this using lattice and ggplot2 but first I will reshape the data in the long format using reshape2

library(reshape2)
dat.m <- melt(dat,id.vars=c('id','day'))

Then I plot my heatmap to see which drivers are overlapping with others for example:

library(lattice)
levelplot(value~as.numeric(variable)*factor(id),data=dat.m)

enter image description here

library(ggplot2)
ggplot(dat.m, aes(x=as.numeric(variable),y=factor(id)))+ 
        geom_tile(aes(fill = value)) +
  scale_fill_gradient(low="grey",high="blue")

enter image description here

agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Wow! Thank you @agstudy. This is exactly the output I was looking for. Your explanation is also very clear. Thank you so much for sharing your time and knowledge. I think I will try to tackle the next steps with R as well. – George K Jun 29 '13 at 05:19
  • @GeorgeK you are welcome. I think also you can change the `tripv[ids[[x]]] <- x` (simple sequence) by distance or/and speed to get more insights .... – agstudy Jun 29 '13 at 05:21
0

This is not an answer to your problem. Honestly it is not clear for me the transition between the data you show in the image and the example of your data. It seems like that you can't reproduce this data. So Here a function to generate a reproducible example of your data. I think it can be at least useful, to validate your model.

function to sample data

library(reshape2)
start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
hours.min <- format(seq(start.min, 
                        length.out=24*60, by = "mins"),
                    '%H:%M')

## function to generate a trip sample
## min.dur : minimal duration of a trip
## max.dur : maximal duration of a trip
## min.trip : minimal number of trips that a user can do 

gen.Trip <- function(min.dur=3,max.dur=10,min.trip=100){
  ## gen number of trip
  n.trip <- sample(seq(min.trip,20),1)
  ## for each trip generate the durations
  durations <- rep(seq(1,n.trip),
                   times=sample(seq(min.dur,max.dur),
                                max(min.dur,n.trip),rep=TRUE))
  ## generate a vector of positions
  rr <- rle(durations)
  mm <- cumsum(rr$lengths)
  ## idrty part here
  pos <- sort(sample(seq(1,length(hours.min)-2*max(mm)),
              n.trip,rep=FALSE)) + mm
  ## assign each trip to each posistion  
  val <- vector(mode='integer',length(hours.min))
  for(x in seq_along(pos))
    val[seq(pos[x],length.out=rr$len[x])] <- rr$val[x]
  val
}

generate trips for 100 drivers

set.seed(1234)
nb.drivers <- 100
res <- replicate(nb.drivers,gen.Trip(),simplify=FALSE)
res <- do.call(rbind,res)
colnames(res) <- hours.min
rownames(res) <- paste0('driv',seq(nb.drivers))

wide format

head(res[,10:30])
  ##       00:09 00:10 00:11 00:12 00:13 00:14 00:15 00:16 00:17 00:18 00:19
## driv1     0     0     0     0     0     0     1     1     1     1     1
## driv2     0     1     1     1     1     1     1     2     2     2     1
## driv3     0     0     0     0     0     0     0     0     0     0     0
## driv4     1     1     1     0     0     0     0     0     0     0     0
## driv5     0     0     0     0     0     0     0     0     0     0     1
## driv6     0     0     0     0     0     0     0     0     0     0     0
##       00:20 00:21 00:22 00:23 00:24 00:25 00:26 00:27 00:28 00:29
## driv1     1     1     0     0     2     2     2     2     2     2
## driv2     0     0     0     0     0     0     3     3     3     3
## driv3     0     0     0     0     0     0     0     0     0     0
## driv4     0     0     0     0     0     0     0     0     0     0
## driv5     1     1     1     1     1     1     1     1     0     0
## driv6     0     0     0     0     0     0     0     0     0     0

long format

res.m <- melt(res)
head(res.m)
##    Var1  Var2 value
## 1 driv1 00:00     0
## 2 driv2 00:00     0
## 3 driv3 00:00     0
## 4 driv4 00:00     0
## 5 driv5 00:00     0
## 6 driv6 00:00     0
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Thanks @agstudy. I will be using data collected from real drivers (i.e. I won't be generating it myself), but this sample could be useful for validating my model on a larger set here. – George K Jun 28 '13 at 18:31
  • @GeorgeK Does this make sense to answer your question using my generated data? – agstudy Jun 28 '13 at 18:53
  • Sorry, I think I may have confused everyone with my sample output. It is not related to or the result of the sample data I provided. I was just trying to show a snapshot of the first 10 minutes of a given day, what it would look like, using some theoretical trips (explained in the right-most column). – George K Jun 28 '13 at 21:26
  • Hi @agstudy. I have tried working with the code you kindly provided. It is not entirely clear to me (due to my limited experience) how your generated data (based on defined specs, etc.) translated into your output. I'm not sure if it allows me to use collected data rather than generated data? – George K Jun 28 '13 at 21:29
  • @GeorgeK What do you mean? The generated data they don't reproduce the data collected( the linked pictures?)? If no why ? If it is ok , my question is simple: What is the relation between the collected data and the data you provide in this question? Does testdata a transformation of the collected data? – agstudy Jun 28 '13 at 21:33
  • Yes, they do produce the type of data I want. However it is not exactly clear to me where (or how) I would insert my own data so I can get the output? Here is a Dropbox link to some data I have: https://dl.dropboxusercontent.com/u/72131660/samplediary.csv The column headers aren't the same as my 'sample' I provided, but it should give you a better sense of the type of data I have to deal with. The times are in minutes (from 0 to 1440). – George K Jun 28 '13 at 22:26
  • To respond directly to your second question: The collected data has a lot more information. I basically provided some very simple data just as an example here. Thanks for your patience. – George K Jun 28 '13 at 22:31
  • @GeorgeK reading you csv file. Did you assume a max(tripx) = 22? – agstudy Jun 28 '13 at 22:59
  • Yes, I only ask for up to 22 trips per day from each driver. – George K Jun 29 '13 at 02:27