(EDIT: one of the issues here is scale, namely what works for one row will blow up/crash R on a 200,000 * 50 dataframe. For example, strptime must be applied column-wise, not row-wise, to avoid hanging. I'm looking for working code solutions that you actually ran on 200,000 * 50 including your measured runtime, not just casual "this is easy" remarks. It's easy to get runtimes > 12 hrs if you pick the wrong fn. Next, I also asked you to make my zero-time adjustment code faster, the job's not finished till that's done. Noone attempted that so far.)
I want to vectorize and accelerate the following multistep log-time conversion, with millisecond accuracy, involving converting strtime()
to a single numeric, followed by subtraction and then log()
on a large data-frame (200,000 rows * 300 cols; other (non-time) columns omitted).
Code below.
As well as making it vectorized and fast, an extra problem is I'm not sure how best to represent the (higher-dimensional) intermediate values at each step e.g. as list from strtime, matrix, vector). I already tried apply,sapply,lapply,vapply,ddply::maply(),...
but the incompatibility of intermediate format(s) keeps messing me up...
Each row has 50 columns time1..time50 (chr, format="HH:MM:SS.sss") representing time as string in millisecond resolution. I need millisecond accuracy.
Within each row, columns time1..time50 are in non-decreasing order, and I want to convert them into log of time before time50. The conversion fn parse_hhmmsecms()
is at bottom, and needs serious vectorization and speeding up, you can see alternative versions commented out. What I figured so far: strtime()
is faster than (multiple) substr()
calls, I then convert somehow to list of three numeric (hh,mm,sec.ms)
, then convert to vector assuming the next step should be to vector-multiply with %*% c(3600,60,1)
to convert to numeric seconds.
Here is pseudocode of what I do for each row, and each time-string; full code is at bottom:
for each row in dataframe { # vectorize this, loop_apply(), or whatever...
#for each time-column index i ('time1'..'time50') { # vectorize this...
hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
# Main computation
xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
# Minor task: fix up all the 'zero-time' events to be evenly spaced between -3..0
#}
}
So there are five subproblems involved:
- How to vectorize handling the list returned by
strtime()
? since it returns a list of 3 items, when passed a 2D dataframe or 1D row of time-strings, we will get a 3D or 2D intermediate object. (do we internally we use list-of-list? matrix of lists? array of lists?) - How to vectorize the entire function
parse_hhmmsecms()
? - Then do the subtraction and log
- Vectorize the zero-time fixup code as well (this is now the slowest part by far)
- How to accelerate steps 1...4.?
Code snippet below using ten example columns time41..50
(use random_hhmmsecms()
if you want a bigger sample)
I did my best to follow these recommendations, this is as reproducible as I can get it in six hours' work:
# Each of 200,000 rows has 50 time strings (chr) like this...
xx <- structure(list(time41 = c("08:00:41.465", "08:00:50.573", "08:00:50.684"
), time42 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"),
time43 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"
), time44 = c("08:00:41.465", "08:00:50.664", "08:00:50.759"
), time45 = c("08:00:41.465", "08:00:50.684", "08:00:50.759"
), time46 = c("08:00:42.496", "08:00:50.684", "08:00:50.759"
), time47 = c("08:00:42.564", "08:00:50.759", "08:00:51.373"
), time48 = c("08:00:48.370", "08:00:50.759", "08:00:51.373"
), time49 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
), time50 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
)), .Names = c("time41", "time42", "time43", "time44", "time45",
"time46", "time47", "time48", "time49", "time50"), row.names = 3:5, class = "data.frame")
# Handle millisecond timing and time conversion
options('digits.secs'=3)
# Parse "HH:MM:SS.sss" timestring into (numeric) number of seconds (Very slow)
parse_hhmmsecms <- function(t) {
as.numeric(substr(t,1,2))*3600 + as.numeric(substr(t,4,5))*60 + as.numeric(substr(t,7,12)) # WORKS, V SLOW
#c(3600,60,1) %*% sapply((strsplit(t[1,]$time1, ':')), as.numeric) # SLOW, NOT VECTOR
#as.vector(as.numeric(unlist(strsplit(t,':',fixed=TRUE)))) %*% c(3600,60,1) # WANT TO VECTORIZE THIS
}
random_hhmmsecms <- function(n=1, min=8*3600, max=16*3600) {
# Generate n random hhmmsecms objects between min and max (8am:4pm)
xx <- runif(n,min,max)
ss <- xx %% 60
mm <- (xx %/% 60) %% 60
hh <- xx %/% 3600
sprintf("%02d:%02d:%05.3f", hh,mm,ss)
}
xx$logtime45 <- xx$logtime44 <- xx$logtime43 <- xx$logtime42 <- xx$logtime41 <- NA
xx$logtime50 <- xx$logtime49 <- xx$logtime48 <- xx$logtime47 <- xx$logtime46 <- NA
# (we pass index vectors as the dataframe column ordering may change)
Ctime <- which(colnames(xx)=='time41') : which(colnames(xx)=='time50')
Clogtime <- which(colnames(xx)=='logtime41') : which(colnames(xx)=='logtime50')
for (i in 40:nrow(xx)) {
#if (i%%100==0) { print(paste('... row',i)) }
hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
# Now fix up all the 'zero-time' events to be evenly spaced between -3..0
Czerotime.p <- which(xx[i,Clogtime]==Inf | xx[i,Clogtime]>-1e-9)
xx[i,Czerotime.p] <- seq(-3,0,length.out=length(Czerotime.p))
}