-3

I have a large secondary data frame with survival observation data (multiple entries for each subject ID). I'm trying to figure out which subjects had their last observation data recorded before the end of the study observation period (eg. before week 100 in the case of this study). Essentially, I'm trying to find out who was lost to follow up. Is there a function that does this? I'm sorry if a similar question has already been answered, but I couldn't think of technically specific enough terms to find anything in a web search. I have a basic literacy in R but I don't have a really strong technical background. Thank you for your time and help!

In the excerpt from the data frame in question below. There is one instance in which the last observation is less than 105 weeks (104).

    structure(list(ID = c(140L, 140L, 141L, 142L, 142L, 143L, 143L, 
144L, 144L, 144L, 144L), WEEK = c(40L, 105L, 105L, 11L, 105L, 
103L, 104L, 37L, 48L, 65L, 105L), OBSDATE = structure(c(40L, 
107L, 107L, 11L, 107L, 105L, 106L, 37L, 48L, 65L, 107L), .Label = c("2002-12-29", 
"2003-01-05", "2003-01-12", "2003-01-19", "2003-01-26", "2003-02-02", 
"2003-02-09", "2003-02-16", "2003-02-23", "2003-03-02", "2003-03-09", 
"2003-03-16", "2003-03-23", "2003-03-30", "2003-04-06", "2003-04-13", 
"2003-04-20", "2003-04-27", "2003-05-04", "2003-05-11", "2003-05-18", 
"2003-05-25", "2003-06-01", "2003-06-08", "2003-06-15", "2003-06-22", 
"2003-06-29", "2003-07-06", "2003-07-13", "2003-07-20", "2003-07-27", 
"2003-08-03", "2003-08-10", "2003-08-17", "2003-08-24", "2003-08-31", 
"2003-09-07", "2003-09-14", "2003-09-21", "2003-09-28", "2003-10-05", 
"2003-10-12", "2003-10-19", "2003-10-26", "2003-11-02", "2003-11-09", 
"2003-11-16", "2003-11-23", "2003-11-30", "2003-12-07", "2003-12-14", 
"2003-12-21", "2003-12-28", "2004-01-04", "2004-01-11", "2004-01-18", 
"2004-01-25", "2004-02-01", "2004-02-08", "2004-02-15", "2004-02-22", 
"2004-02-29", "2004-03-07", "2004-03-14", "2004-03-21", "2004-03-27", 
"2004-03-28", "2004-04-04", "2004-04-11", "2004-04-18", "2004-04-25", 
"2004-05-02", "2004-05-09", "2004-05-16", "2004-05-23", "2004-05-30", 
"2004-06-06", "2004-06-10", "2004-06-13", "2004-06-20", "2004-06-27", 
"2004-07-04", "2004-07-11", "2004-07-18", "2004-07-25", "2004-08-01", 
"2004-08-08", "2004-08-15", "2004-08-22", "2004-08-29", "2004-09-05", 
"2004-09-12", "2004-09-19", "2004-09-26", "2004-10-03", "2004-10-10", 
"2004-10-17", "2004-10-24", "2004-10-31", "2004-11-07", "2004-11-14", 
"2004-11-21", "2004-11-28", "2004-12-05", "2004-12-12", "2004-12-19", 
"2004-12-26", "2005-11-24", "2006-11-02", "2007-02-26", "2009-05-18", 
"2010-08-11", "2011-01-29", "2013-09-06", "2017-04-23", "2017-05-13", 
"2019-05-01", "2022-11-22", "2026-03-20", "2026-08-15", "2028-09-26", 
"2030-02-08", "2034-08-30", "2035-01-22", "2035-10-14", "2037-09-20", 
"2038-05-09", "2043-01-31", "2043-08-19", "2045-03-29", "2046-05-15", 
"2050-03-06", "2053-10-15", "2054-05-22", "2056-06-09", "2060-03-13", 
"2061-04-15", "2061-08-30", "2062-07-10"), class = "factor")), .Names = c("ID", 
"WEEK", "OBSDATE"), row.names = 231:241, class = "data.frame")
  • 2
    Make a reproducible example. This won't be hard, but it will depend on how your data is structured. [See here for tips](http://stackoverflow.com/q/5963269/903061), like using `dput()`. – Gregor Thomas Jul 19 '15 at 23:19
  • Thank you @Gregor I have added a reproducible data frame excerpt with one instance of a last observation before the end of study (<105). – Michael Ruderman Jul 21 '15 at 17:49

2 Answers2

0

One way to approach this is by using an old function I use to have for analysing controlled trais studies.

followup <- function (id, time) {
if(length(id) !=length(time)) stop("The length of these two variables must be equal")
if(any(duplicated(paste(id,time)))) stop("The combination of id and time must be unique")
original.order <- 1:length(id)
if(any(data.frame(id, time) != data.frame(id[order(id, time)], time[order(id,time)]))){
  new.order <- original.order[order(id,time)]
  id <- id[order(id,time)]
  time <- time[order(id,time)]
}
list1 <- rle(as.vector(id))
unlist(sapply(X=list1$lengths, FUN=function(x) 1:x, simplify=FALSE)) -> visit
visit[order(original.order)]
}

As you didn't give any clue about you data, here I'm simulating some:

data=as.data.frame(list(ID=sample(LETTERS, 50, rep=TRUE),variable=rnorm(50,50,10)))

rand.date=function(start.day,end.day,data){   
  size=dim(data)[1]    
  days=seq.Date(as.Date(start.day),as.Date(end.day),by="day")  
  pick.day=runif(size,1,length(days))  
  date=days[pick.day]  
}
data$date=rand.date("2010-01-01","2015-07-18",data)

> data
   ID variable       date
1   L 52.75080 2010-12-28
2   W 51.36106 2011-11-24
3   S 46.52550 2011-06-19
4   S 64.37270 2013-06-18
5   X 68.47047 2015-03-17
6   Y 44.52643 2010-11-18
7   O 51.61603 2015-04-13
.....          ......

# Executing the function:
data$follow<- followup(data$ID, data$date)
> data
   ID variable       date follow
1   L 52.75080 2010-12-28      1
2   W 51.36106 2011-11-24      2
3   S 46.52550 2011-06-19      1
4   S 64.37270 2013-06-18      2
5   X 68.47047 2015-03-17      3
6   Y 44.52643 2010-11-18      1
7   O 51.61603 2015-04-13      2
8   C 60.06102 2014-06-22      3

So, all you have to do is sort out the data.frame by the follow column and see when was the last time a subject was seen in the study.

library(dplyr)
> data %>% group_by(ID) %>% arrange(follow)
Source: local data frame [50 x 4]
Groups: ID

    ID variable       date follow
1   A 61.75308 2014-06-28      1
2   A 32.19119 2015-05-15      2
3   B 45.40385 2011-09-07      1
4   B 52.31812 2014-12-24      2
5   C 50.75906 2014-06-09      1
6   C 34.27607 2012-10-29      2
7   C 60.06102 2014-06-22      3
8   D 61.69071 2014-06-17      1
9   D 51.49701 2014-05-22      2
.. ..      ...        ...    ...
daniel
  • 1,186
  • 2
  • 12
  • 21
  • this is a really large data frame (over 13,000) columns, I was hoping for a solution that might return the index positions of all subjects IDs that meet the criteria for last observation date <105 weeks (or before 2004-12-26). I added a selection of the data in the original post. Thank you so much for your help. – Michael Ruderman Jul 21 '15 at 18:12
  • @MichaelRuderman "I was hoping for a solution that might return the index positions of all subjects IDs"... that's the kind of thing (desired output) that belongs in your question, not buried in a comment. – Gregor Thomas Jul 21 '15 at 18:13
0

Using your provided data (and calling it dat):

library(dplyr)
group_by(dat, ID) %>%
    summarize(censored = max(WEEK) < 105)

# Source: local data frame [5 x 2]
# 
#    ID censored
# 1 140    FALSE
# 2 141    FALSE
# 3 142    FALSE
# 4 143     TRUE
# 5 144    FALSE

If you want the index in the original data of the subject IDs that are censored:

cens_id = group_by(dat, ID) %>%
        summarize(censored = max(WEEK) < 105) %>%
        filter(censored)

which(dat$ID %in% cens_id$ID)
# [1] 6 7
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294