0

I have a set of dates and times for several individuals (ID) that correspond to our primary outcome measure (Y) and a covariate (X1).

My objective is to replace missing X1 values for each of the Y rows if the X1 measurement was recorded within a +/- 24 hour period from the date/time that the Y variable was measured. To make this easier to visualize (and load into R), here is how the data are currently arranged:

structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L), TIME = structure(1:15, .Label = c("01/01/2013 12:01", 
"01/03/2013 08:49", "01/03/2013 20:52", "02/01/2013 05:00", "02/03/2013 05:30", 
"02/03/2013 21:14", "02/05/2013 05:15", "02/12/2013 05:03", "02/15/2013 04:16", 
"02/16/2013 04:12", "02/16/2013 21:02", "03/01/2010 17:58", "03/02/2010 00:10", 
"03/03/2010 10:45", "03/04/2010 09:00"), class = "factor"), Y = structure(c(1L, 
5L, 7L, 1L, 1L, 2L, 1L, 1L, 1L, 4L, 3L, 1L, 8L, 1L, 6L), .Label = c(".", 
"22", "35", "4", "5", "6", "8", "9"), class = "factor"), X1 = structure(c(2L, 
1L, 1L, 7L, 7L, 1L, 4L, 4L, 3L, 1L, 1L, 6L, 1L, 5L, 1L), .Label = c(".", 
"0.1", "0.2", "0.4", "0.6", "0.9", "1.0"), class = "factor")), .Names = c("ID", 
"TIME", "Y", "X1"), class = "data.frame", row.names = c(NA, -15L))

To simplify the desired output, I would like to only display the rows with non-missing Y values, such that the end product would look like this:

  ID             TIME  Y  X1
1  1 01/03/2013 08:49  5   .
2  1 01/03/2013 20:52  8   .
3  2 02/03/2013 21:14 22   .
4  2 02/16/2013 04:12  4 0.2
5  2 02/16/2013 21:02 35   .
6  3 03/02/2010 00:10  9 0.9
7  3 03/04/2010 09:00  6 0.6

Is it possible to (1) iterate across multiple rows and evaluate the absolute value of 24 hours to get the difference between the X1 and Y measurements and (2) to replace the missing values of X1 with those that are within the +/- 24 hour window?

Any thoughts on how to go about this would be greatly appreciated!

Entropy
  • 378
  • 6
  • 16
  • What if there is more than one `X1` value within 24 hours of a missing `X1`? – Blue Magister Oct 25 '13 at 01:41
  • It is good manners to add a [reproducible example of your code](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in your post, using `dput`. – Blue Magister Oct 25 '13 at 01:54
  • Where is the 0.2 in the third row of the output coming from? – TheComeOnMan Oct 25 '13 at 05:19
  • My apologies -- you are absolutely right. The '0.2' in the third row is wrong and should be changed to missing '.'. This has now been fixed. I apologise for the confusion induced by my manual review of these. Clearly, I need an algorithmic solution... – Entropy Oct 25 '13 at 15:17
  • If there is more than 1 `X1` values within 24 hours, I would favour taking the value that was closest in time to the `Y` measurement. – Entropy Oct 25 '13 at 15:18
  • Thanks for the `dput` link. I have never used that before but will look into it and see if I can edit this post to include a reproducible example. Thanks! – Entropy Oct 25 '13 at 15:19

1 Answers1

0

if you convert your data into xts then you can use xts's easy subsetting feature to get what you want.

PS: following code will work if you have exactly 1 value of X1 within 24 hour period of Y measurement.

require(xts)
xx <- xts(DF[, c(1, 4, 5)], as.POSIXct(paste0(DF$Date, " ", DF$TIME), format = "%m/%d/%Y %H:%M"))


sapply(index(xx[!is.na(xx$Y)]), FUN = function(tt) {
    startTime <- tt - 24 * 60 * 60
    endTime <- tt + 24 * 60 * 60
    y <- xx[paste(startTime, endTime, sep = "/")]
    if (nrow(y[!is.na(y$X1), "X1"]) != 0) {
        return(as.vector(y[!is.na(y$X1), "X1"]))
    } else {
        return(NA)
    }

})
## [1] 0.9 0.6  NA  NA 1.0 0.2  NA


xx[!is.na(xx$Y), "X1"] <- sapply(index(xx[!is.na(xx$Y)]), FUN = function(tt) {
    startTime <- tt - 24 * 60 * 60
    endTime <- tt + 24 * 60 * 60
    y <- xx[paste(startTime, endTime, sep = "/")]
    if (nrow(y[!is.na(y$X1), "X1"]) != 0) {
        return(as.vector(y[!is.na(y$X1), "X1"]))
    } else {
        return(NA)
    }

})

xx[!is.na(xx$Y), "X1"]
##                      X1
## 2010-03-02 00:10:00 0.9
## 2010-03-04 09:00:00 0.6
## 2013-01-03 08:49:00  NA
## 2013-01-03 20:52:00  NA
## 2013-02-03 21:14:00 1.0
## 2013-02-16 04:12:00 0.2
## 2013-02-16 21:02:00  NA
CHP
  • 16,981
  • 4
  • 38
  • 57
  • This looks really good -- thank you! On the last step, where we try to remove the missing Y values I get the following error: `Error in xx[!is.na(xx$Y), "X1"] : incorrect number of dimensions`. There is a second warning that also reads: `In is.na(xx$Y) : is.na() applied to non-(list or vector) of type 'NULL'`. Do you have any idea as to how I might be able to replicate your results? Thanks! – Entropy Oct 25 '13 at 16:27