7

Setup

I have a column of durations stored as a strings in a dataframe. I want to convert them to an appropriate time object, probably POSIXlt. Most of the strings are easy to parse using this method:

> data <- data.frame(time.string = c(
+   "1 d 2 h 3 m 4 s",
+   "10 d 20 h 30 m 40 s",
+   "--"))
> data$time.span <- strptime(data$time.string, "%j d %H h %M m %S s")
> data$time.span
[1] "2012-01-01 02:03:04" "2012-01-10 20:30:40" NA

Missing durations are coded "--" and need to be converted to NA - this already happens but should be preserved.

The challenge is that the string drops zero-valued elements. Thus the desired value 2012-01-01 02:00:14 would be the string "1 d 2 h 14 s". However this string parses to NA with the simple parser:

> data2 <- data.frame(time.string = c(
+  "1 d 2 h 14 s",
+  "10 d 20 h 30 m 40 s",
+  "--"))
> data2$time.span <- strptime(data2$time.string, "%j d %H h %M m %S s")
> data2$time.span
[1] NA "2012-01-10 20:30:40" NA

Questions

  1. What is the "R Way" to handle all the possible string formats? Perhaps test for and extract each element individually, then recombine?
  2. Is POSIXlt the right target class? I need duration free from any specific start time, so the addition of false year and month data (2012-01-) is troubling.

Solution

@mplourde definitely had the right idea w/ dynamic creation of a formatting string based on testing various conditions in the date format. The addition of cut(Sys.Date(), breaks='years') as the baseline for the datediff was also good, but failed to account for a critical quirk in as.POSIXct() Note: I'm using R2.11 base, this may have been fixed in later versions.

The output of as.POSIXct() changes dramatically depending on whether or not a date component is included:

> x <- "1 d 1 h 14 m 1 s"
> y <-     "1 h 14 m 1 s"  # Same string, no date component
> format (x)  # as specified below
[1] "%j d %H h %M m %S s"
> format (y)
[1] "% H h % M %S s"    
> as.POSIXct(x,format=format)  # Including the date baselines at year start
[1] "2012-01-01 01:14:01 EST"
> as.POSIXct(y,format=format)  # Excluding the date baselines at today start
[1] "2012-06-26 01:14:01 EDT"

Thus the second argument for the difftime function should be:

  • The start of the first day of the current year if the input string has a day component
  • The start of the current day if the input string does not have a day component

This can be accomplished by changing the unit parameter on the cut function:

parse.time <- function (x) {
  x <- as.character (x)
  break.unit <- ifelse(grepl("d",x),"years","days")  # chooses cut() unit
  format <- paste(c(if (grepl("d", x)) "%j d",
                    if (grepl("h", x)) "%H h",
                    if (grepl("m", x)) "%M m",
                    if (grepl("s", x)) "%S s"), collapse=" ")

  if (nchar(format) > 0) {
    difftime(as.POSIXct(x, format=format), 
             cut(Sys.Date(), breaks=break.unit),
             units="hours")
  } else {NA}

}
Community
  • 1
  • 1
SuperAce99
  • 712
  • 6
  • 13
  • This might give some direction: http://stackoverflow.com/questions/1828206/why-is-parsing-y-m-using-strptime-in-r-giving-an-na-result-but-y-m-d-w – David J. Jun 21 '12 at 21:39
  • This doesn't solve your problem, but you are having problems with [strptime](http://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html) because it is not designed to parse durations; it is intended to parse timestamps. (Some durations look like timestamps, some don't.) – David J. Jun 21 '12 at 21:45
  • @DavidJames Ok, that makes sense. Would you agree with @mplourde that it's better to format first, then cast using `as.difftime()`? – SuperAce99 Jun 22 '12 at 13:32
  • If you start with a string, you have to parse it first, by definition. :) Then it only makes sense to pick the type to cast it to -- and `difftime` makes sense (lubridate uses it as well). – David J. Jun 22 '12 at 13:46

2 Answers2

11

difftime objects are time duration objects that can be added to either POSIXct or POSIXlt objects. Maybe you want to use this instead of POSIXlt?

Regarding the conversion from strings to time objects, you could do something like this:

data <- data.frame(time.string = c(
    "1 d 1 h",
    "30 m 10 s",
    "1 d 2 h 3 m 4 s",
    "2 h 3 m 4 s",
    "10 d 20 h 30 m 40 s",
    "--"))

f <- function(x) {
    x <- as.character(x)
    format <- paste(c(if (grepl('d', x)) '%j d',
                      if (grepl('h', x)) '%H h',
                      if (grepl('m', x)) '%M m',
                      if (grepl('s', x)) '%S s'), collapse=' ')

    if (nchar(format) > 0) {
        if (grepl('%j d', format)) {
            # '%j 1' is day 0. We add a day so that x = '1 d' means 24hrs.
            difftime(as.POSIXct(x, format=format) + as.difftime(1, units='days'), 
                    cut(Sys.Date(), breaks='years'),
                    units='hours')
        } else {
            as.difftime(x, format, units='hours')
        }
    } else { NA }
}

data$time.span <- sapply(data$time.string, FUN=f)
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • 2
    Here is another way to calculate `format` in the above: `library(gsubfn); format <- paste(strapply(x, "[dhms]", list(d = "%j d", h = "%H h", m = "%M m", s = "%S s"))[[1]], collapse = " ")` – G. Grothendieck Jun 22 '12 at 00:25
  • @mplourde Thank you for your detailed reply, I'm working to implement & test. I'm still getting a feel for how `paste()` and `sapply()` are used in R, so I need to dig into exactly how this works. – SuperAce99 Jun 22 '12 at 13:42
  • A solid answer. I tested and it worked for me. Yes, `difftime` is the best data type for handling durations. – David J. Jun 22 '12 at 13:58
  • I've got this solution working with a few tweaks, mainly adding a units argument for consistency: `as.difftime(x, format=format,units="hours"`.Weirdly it produces some *negative* `difftime` values, which is illegal for duration. I'm investigating which cases cause this behavior. – SuperAce99 Jun 22 '12 at 14:50
  • i've updated the solution to handle your julian days correctly. – Matthew Plourde Jun 22 '12 at 22:26
  • The revised code eliminates the negative values, but seems to simply take their absolute value. My `DiffTime` variable jumps from ~1600 to ~4200, which doesn't make sense given the data. Looking into this. @mplourde sorry to leave your great response unselected so long, I'm waiting until I confirm the solution. – SuperAce99 Jun 26 '12 at 19:34
  • hmm...I get the right time spans for the samples provided. Which of your `time.string`s lead to the unexpected values? – Matthew Plourde Jun 26 '12 at 19:41
  • @mplourde see updated main post for description of the problem. It took me a while to find the problem records. Your updated solution worked for ~96% of my 10k records. I'm pretty new to R, so building the diagnostics was slow. – SuperAce99 Jun 26 '12 at 22:16
  • Unless '1 d 1 h' and '1 h' are supposed to be same time spans, '%j' is not the correct formatting parameter for this. Does `any(grepl('366 d', data$time.string))` return `TRUE`? If not, '1 d 1 h' and '1 h' are probably meant to have different meanings. For the sake of completeness, I've edited my response show a way to treat them differently. – Matthew Plourde Jun 27 '12 at 05:05
3

I think you will have better luck with lubridate:

From Dates and Times Made Easy with lubridate:

5.3. Durations

...

The length of a duration is invariant to leap years, leap seconds, and daylight savings time because durations are measured in seconds. Hence, durations have consistent lengths and can be easily compared to other durations. Durations are the appropriate object to use when comparing time based attributes, such as speeds, rates, and lifetimes. lubridate uses the difftime class from base R for durations. Additional difftime methods have been created to facilitate this.

lubridate uses the difftime class from base R for durations. Additional difftime methods have been created to facilitate this.

...

Duration objects can be easily created with the helper functions dyears(), dweeks(), ddays(), dhours(), dminutes(), and dseconds(). The d in the title stands for duration and distinguishes these objects from period objects, which are discussed in Section 5.4. Each object creates a duration in seconds using the estimated relationships given above.

That said, I haven't (yet) found a function to parse a string into a duration.


You might also take a look at Ruby's Chronic to see how elegant time parsing can be. I haven't found a library like this for R.

David J.
  • 31,569
  • 22
  • 122
  • 174
  • These links are useful, thanks. For the moment I'm limited to using base R 2.11. Frustrating, but a constraint I have to live with. Fortunately I don't have a natural language requirement at the moment. I'm interested in trying a project like that in the future though, Chronic may be a useful way to go. [Recorded Future](https://www.recordedfuture.com/this-is-recorded-future/how-recorded-future-works/) is one firm working in this area; interesting to see where you can take it when it works. – SuperAce99 Jun 22 '12 at 13:47