4

I am an R beginner and have got stuck with this seemingly simple problem. I have a large data frame with 4 columns; id, date of observation, a value (alb) and an end date. A single id may have between 1 and 15 or so observations at different dates. The end date is the time of event or censoring with one per id.

    id        date  alb         end
  1143  2010-03-23   41  2010-12-15
  1143  2010-06-29   39  2010-12-15
  1144  2008-01-01   34  2009-08-06 
  1145  2010-03-23   42  2012-10-25 
  1145  2011-01-12   45  2012-10-25

For survival analysis using alb as a time varying covariate I am trying to create an episode for each observation with a start and stop time column. I am trying to create a column where the stop time is the start time for the next alb observation or the end time if there is no further alb observation for that id. Like so:

    id        date  alb         end       start        stop
  1143  2010-03-23   41  2010-12-15  2010-03-23  2010-06-29
  1143  2010-06-29   39  2010-12-15  2010-06-29  2010-12-15
  1144  2008-01-01   34  2009-08-06  2008-01-01  2009-08-06
  1145  2010-03-23   42  2012-10-25  2010-03-23  2011-01-12
  1145  2011-01-12   45  2012-10-25  2011-01-12  2012-10-25

I am getting stuck with creating a column of stop times. I got in a mess trying to make a function with nested if else statements. Does anyone have a simple approach? Thanks in advance!

in reply to r2evans, this is a large portion of the data.frame where some of the values from the dplyr action return 1970-01-01. (the full data frame is about 130,000 rows). Thanks

  id       date  alb        end
1143 2010-03-23 41.0 1996-08-10
1143 2010-06-29 39.0 1996-08-10
1143 2011-01-12 42.0 1996-08-10
1143 2010-09-28 47.0 1996-08-10
1143 2011-07-19 40.0 1996-08-10
1143 2012-06-12 41.0 1996-08-10
1143 2013-06-25 40.0 1996-08-10
1143 2013-12-26 40.0 1996-08-10
1143 2014-06-15 40.0 1996-08-10
1143 2014-12-26 39.9 1996-08-10
1144 2008-01-01 34.0 2015-04-28
1145 2010-03-23 42.0 2015-04-28
1145 2012-01-13 44.0 2015-04-28
1145 2012-06-15 41.0 2015-04-28
  • Do these answers suffice? If so, please accept one (it's customary on SO). – r2evans Jun 01 '15 at 19:43
  • 1
    For future reference: it's better to use `dput` function to share example of data (see http://stackoverflow.com/q/5963269/168747). – Marek Jun 08 '15 at 21:36

2 Answers2

3

Your data:

data.frame(
    id=c(        1143,         1143,         1144,         1145,         1145 ),
  date=c("2010-03-23", "2010-06-29", "2008-01-01", "2010-03-23", "2011-01-12" ),
   alb=c(          41,           39,           34,           42,           45 ),
   end=c("2010-12-15", "2010-12-15", "2009-08-06", "2012-10-25", "2012-10-25" )
)

One technique is to use dplyr:

library(dplyr)
df %>%
    group_by(id) %>%
    mutate(start=date, stop=lead(start, default=end[1]))
## Source: local data frame [5 x 6]
## Groups: id
## 
##     id       date alb        end      start       stop
## 1 1143 2010-03-23  41 2010-12-15 2010-03-23 2010-06-29
## 2 1143 2010-06-29  39 2010-12-15 2010-06-29 2010-12-15
## 3 1144 2008-01-01  34 2009-08-06 2008-01-01 2009-08-06
## 4 1145 2010-03-23  42 2012-10-25 2010-03-23 2011-01-12
## 5 1145 2011-01-12  45 2012-10-25 2011-01-12 2012-10-25

If you'd rather do it with base functions only:

do.call('rbind', by(df, df$id, function(x) {
    cbind(x, start=x$date, stop=lead(x$date, default=x$end[1]))
}))
##          id       date alb        end      start       stop
## 1143.1 1143 2010-03-23  41 2010-12-15 2010-03-23 2010-06-29
## 1143.2 1143 2010-06-29  39 2010-12-15 2010-06-29 2010-12-15
## 1144   1144 2008-01-01  34 2009-08-06 2008-01-01 2009-08-06
## 1145.4 1145 2010-03-23  42 2012-10-25 2010-03-23 2011-01-12
## 1145.5 1145 2011-01-12  45 2012-10-25 2011-01-12 2012-10-25
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • A similar approach with `data.table` `setDT(dat)[, c('start', 'stop') := list(date,shift(date, type='lead',fill=end[1L])), by=id]` – akrun Jun 01 '15 at 05:39
  • thanks @r2evans, the dplyr solution is easy to read and works with the exception that the last stop date in the series of observations for each ID appears as 1970-01-01 04:35:35 (something to do with POSIXct ?). In anycase it is very easy to replace with the right end date now. – Martin Wolley Jun 02 '15 at 02:21
  • Your data doesn't show that anomaly. Can you provide a small dataset that shows that problem? – r2evans Jun 02 '15 at 02:28
  • @MartinWolley, I run the code with your newer data and see no "1970" in my results. I recognize that you cannot paste all of your data, but I also cannot help if I cannot see the problem. If you can deal with it manually, that's great, no need to drag this out unless you want to fix it programmatically. – r2evans Jun 02 '15 at 05:16
  • Thanks. Problem seemed to disappear when I made all the dates as.Date first rather than the POSIXct they were coded as before. – Martin Wolley Jun 03 '15 at 00:14
1

Sample data:

dat<-read.table(text="
    id        date  alb         end
  1143  2010-03-23   41  2010-12-15
  1143  2010-06-29   39  2010-12-15
  1144  2008-01-01   34  2009-08-06 
  1145  2010-03-23   42  2012-10-25 
  1145  2011-01-12   45  2012-10-25", header=TRUE, stringsAsFactors=FALSE)

Solution:

dat$start <- dat$date
dat$stop[!duplicated(dat$id, fromLast = TRUE)] <- dat$end[!duplicated(dat$id, fromLast = TRUE)]
dat$stop[duplicated(dat$id, fromLast = TRUE)] <- dat[duplicated(dat$id), "date"]

dat
#    id       date alb        end      start       stop
#1 1143 2010-03-23  41 2010-12-15 2010-03-23 2010-06-29
#2 1143 2010-06-29  39 2010-12-15 2010-06-29 2010-12-15
#3 1144 2008-01-01  34 2009-08-06 2008-01-01 2009-08-06
#4 1145 2010-03-23  42 2012-10-25 2010-03-23 2011-01-12
#5 1145 2011-01-12  45 2012-10-25 2011-01-12 2012-10-25
Jota
  • 17,281
  • 7
  • 63
  • 93