How to vectorize and accelerate strtime() logtime conversion on a data-frame

Question

(EDIT: one of the issues here is scale, namely what works for one row will blow up/crash R on a 200,000 * 50 dataframe. For example, strptime must be applied column-wise, not row-wise, to avoid hanging. I'm looking for working code solutions that you actually ran on 200,000 * 50 including your measured runtime, not just casual "this is easy" remarks. It's easy to get runtimes > 12 hrs if you pick the wrong fn. Next, I also asked you to make my zero-time adjustment code faster, the job's not finished till that's done. Noone attempted that so far.)

I want to vectorize and accelerate the following multistep log-time conversion, with millisecond accuracy, involving converting strtime() to a single numeric, followed by subtraction and then log() on a large data-frame (200,000 rows * 300 cols; other (non-time) columns omitted). Code below. As well as making it vectorized and fast, an extra problem is I'm not sure how best to represent the (higher-dimensional) intermediate values at each step e.g. as list from strtime, matrix, vector). I already tried apply,sapply,lapply,vapply,ddply::maply(),... but the incompatibility of intermediate format(s) keeps messing me up...

Each row has 50 columns time1..time50 (chr, format="HH:MM:SS.sss") representing time as string in millisecond resolution. I need millisecond accuracy. Within each row, columns time1..time50 are in non-decreasing order, and I want to convert them into log of time before time50. The conversion fn parse_hhmmsecms() is at bottom, and needs serious vectorization and speeding up, you can see alternative versions commented out. What I figured so far: strtime() is faster than (multiple) substr() calls, I then convert somehow to list of three numeric (hh,mm,sec.ms), then convert to vector assuming the next step should be to vector-multiply with %*% c(3600,60,1) to convert to numeric seconds. Here is pseudocode of what I do for each row, and each time-string; full code is at bottom:

 for each row in dataframe { # vectorize this, loop_apply(), or whatever...
 #for each time-column index i ('time1'..'time50') { # vectorize this...
 hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
 # Main computation
 xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
 # Minor task: fix up all the 'zero-time' events to be evenly spaced between -3..0
 #}
 }

So there are five subproblems involved:

How to vectorize handling the list returned by strtime()? since it returns a list of 3 items, when passed a 2D dataframe or 1D row of time-strings, we will get a 3D or 2D intermediate object. (do we internally we use list-of-list? matrix of lists? array of lists?)
How to vectorize the entire function parse_hhmmsecms()?
Then do the subtraction and log
Vectorize the zero-time fixup code as well (this is now the slowest part by far)
How to accelerate steps 1...4.?

Code snippet below using ten example columns time41..50 (use random_hhmmsecms() if you want a bigger sample)

I did my best to follow these recommendations, this is as reproducible as I can get it in six hours' work:

# Each of 200,000 rows has 50 time strings (chr) like this...    
xx <- structure(list(time41 = c("08:00:41.465", "08:00:50.573", "08:00:50.684"
), time42 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"), 
    time43 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"
    ), time44 = c("08:00:41.465", "08:00:50.664", "08:00:50.759"
    ), time45 = c("08:00:41.465", "08:00:50.684", "08:00:50.759"
    ), time46 = c("08:00:42.496", "08:00:50.684", "08:00:50.759"
    ), time47 = c("08:00:42.564", "08:00:50.759", "08:00:51.373"
    ), time48 = c("08:00:48.370", "08:00:50.759", "08:00:51.373"
    ), time49 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
    ), time50 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
    )), .Names = c("time41", "time42", "time43", "time44", "time45", 
"time46", "time47", "time48", "time49", "time50"), row.names = 3:5, class = "data.frame")

# Handle millisecond timing and time conversion
options('digits.secs'=3)

# Parse "HH:MM:SS.sss" timestring into (numeric) number of seconds (Very slow)
parse_hhmmsecms <- function(t) {
  as.numeric(substr(t,1,2))*3600 + as.numeric(substr(t,4,5))*60 + as.numeric(substr(t,7,12)) # WORKS, V SLOW

  #c(3600,60,1) %*% sapply((strsplit(t[1,]$time1, ':')), as.numeric) # SLOW, NOT VECTOR

  #as.vector(as.numeric(unlist(strsplit(t,':',fixed=TRUE)))) %*% c(3600,60,1) # WANT TO VECTORIZE THIS
}

random_hhmmsecms <- function(n=1, min=8*3600, max=16*3600) {
# Generate n random hhmmsecms objects between min and max (8am:4pm)
xx <- runif(n,min,max)
ss <- xx %%  60
mm <- (xx %/% 60) %% 60
hh <- xx %/% 3600
sprintf("%02d:%02d:%05.3f", hh,mm,ss)
}

xx$logtime45 <- xx$logtime44 <- xx$logtime43 <- xx$logtime42  <- xx$logtime41  <- NA
xx$logtime50 <- xx$logtime49 <- xx$logtime48 <- xx$logtime47  <- xx$logtime46  <- NA

# (we pass index vectors as the dataframe column ordering may change) 
Ctime <- which(colnames(xx)=='time41') : which(colnames(xx)=='time50')
Clogtime <- which(colnames(xx)=='logtime41') : which(colnames(xx)=='logtime50')
for (i in 40:nrow(xx)) {
  #if (i%%100==0) { print(paste('... row',i)) }

  hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
  xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))

  # Now fix up all the 'zero-time' events to be evenly spaced between -3..0
  Czerotime.p <- which(xx[i,Clogtime]==Inf | xx[i,Clogtime]>-1e-9)
  xx[i,Czerotime.p] <- seq(-3,0,length.out=length(Czerotime.p))  
}

Look at the vectorized built in strptime function and related POSICct and POSIXlt classes. You can actually apply strptime over a vector and convert it to numeric. There are many built in time handling functions. — John, Dec 27 '11 at 21:29
John, no, I tried that weeks ago, *`as.POSIXlt()`* conversion discards milliseconds (chron and zoo appear similarly unuseable). — smci, Dec 27 '11 at 22:34
I found [this](http://stackoverflow.com/questions/2150138/how-to-parse-milliseconds-in-r) in my first Google search attempt. (It was the very first result, in fact.) — joran, Dec 27 '11 at 22:40
In fact asPOSIXlt does store ms but not display them, unless you use options('digits.secs'=3) (which I already had in my code, but commented out) — smci, Dec 27 '11 at 23:08
The zero-time fixup code (not the strptime part) is now the slowest part, noone has attempted that yet. Please post your actual runtime on 200,000 * 50 dataframe or similar. I currently get > 12 hrs. — smci, Dec 28 '11 at 18:55

Dirk Eddelbuettel · Answer 1 · 2011-12-27T22:57:05.557

2

You may be overcomplicating things.

Start with base classes which do milliseconds very well (and on appropriate operating systems even microseconds) but note that

you need to set options("digits.secs"=7) (that's the max that can be displayed) to see them displayed
you need an additional parsing character for strptime et al

all of which is in the docs, and countless examples here on SO.

Quick examples:

R> someTime <- ISOdatetime(2011, 12, 27, 2, 3, 4.567)
R> someTime
[1] "2011-12-27 02:03:04.567 CST"
R> now <- Sys.time()
R> now
[1] "2011-12-27 16:48:20.247298 CST"      # microsecond display on Linux
R> 
R> txt <- "2001-02-03 04:05:06.789123"
R> strptime(txt, "%Y-%m-%d %H:%M:%OS")    # note the %0S for sub-seconds
[1] "2001-02-03 04:05:06.789123"
R>

And key functions such as strptime or as.POSIXct are all vectorised and you can throw entire columns at them.

edited Dec 27 '11 at 22:57

answered Dec 27 '11 at 22:51

Dirk Eddelbuettel

360,940
56
644
725

thanks but the dataset comes as a csv, if you read in the dates and do strptime in colClasses I seem to recall it blows up memory hugely. I'll retry it in slices. A second reason it's undesirable to store times with a fake date e.g. "2001-02-03" is since I print them, compare them (< > ==), manipulate them, aggregate by them, use them as graph or histogram labels etc. But yeah I can use *`as.numeric(strptime(...) - origin_date)`* – smci Dec 27 '11 at 23:12
You *can* and maybe *should* do arithmetic on `Date` and `POSIXct` classes. If you only have hours:mins:secs, maybe append them onto a nominal base date of 2001-01-01 or something. The memory issue is known, that sadly seems to be the price we have to pay for the general usability of `strptime`. – Dirk Eddelbuettel Dec 27 '11 at 23:16
The zero-time fixup code is now the slow part, would you please address that too? Doing the zero-time fixup row-wise takes > 12hrs. Please post your actual runtime on 200,000 * 50 dataframe or similar. – smci Dec 28 '11 at 18:54

How to vectorize and accelerate strtime() logtime conversion on a data-frame

1 Answers1