0

I think exemple are easier to understand. So here is how to generate a small fake data set as an exemple :

library(tidyr)
day_event<- as.Date("2017-03-01") + 0:6
a<-rep(1,7)
b<-as.numeric(c("", rep(1,6)))
c<-as.numeric(c("","",rep(1,5)))
df_1<-data.frame(day_event,a,b,c)
names(df_1)[2]<-"2017-03-08"
names(df_1)[3]<-"2017-03-09"
names(df_1)[4]<-"2017-03-10"

> df_1
  day_event 2017-03-08 2017-03-09 2017-03-10
1  2017-03-01          1         NA         NA
2  2017-03-02          1          1         NA
3  2017-03-03          1          1          1
4  2017-03-04          1          1          1
5  2017-03-05          1          1          1
6  2017-03-06          1          1          1
7  2017-03-07          1          1          1

I get the data set in df2 format but using tidyr I can go from one format to the other :

df_2<-gather(df_1, day_measure, measure, -day_event)
> df_2
 day_event  day_measure measure
1   2017-03-01 2017-03-08       1
2   2017-03-02 2017-03-08       1
3   2017-03-03 2017-03-08       1
4   2017-03-04 2017-03-08       1
5   2017-03-05 2017-03-08       1
6   2017-03-06 2017-03-08       1
7   2017-03-07 2017-03-08       1
8   2017-03-01 2017-03-09      NA
9   2017-03-02 2017-03-09       1
10  2017-03-03 2017-03-09       1
11  2017-03-04 2017-03-09       1
12  2017-03-05 2017-03-09       1
13  2017-03-06 2017-03-09       1
14  2017-03-07 2017-03-09       1
15  2017-03-01 2017-03-10      NA
16  2017-03-02 2017-03-10      NA
17  2017-03-03 2017-03-10       1
18  2017-03-04 2017-03-10       1
19  2017-03-05 2017-03-10       1
20  2017-03-06 2017-03-10       1
21  2017-03-07 2017-03-10       1

For the context, it represents measures of an event that happened on day_event. But depending on the day the measure is performed the measure of the event on event_day can be different !

My probleme is that I only measure events seven days back : that's why the measure on day_mesure = '2017-03-09' for the day_event = '2017-03-01' is NA

I would like to replace this NA by the last measured perform (7 days after the day_event) : in this case replace by the measure made on '2017-03-08'

I tried

for (i in 1:length(df_2$measure)){
    row<- df_2[i,]
    if (row$day_event +7 < row$day_measure & length(df_2[df_2$day_event == row$day_event & df_2$day_measure == row$day_event + 7,]$measure)>0){
      row$measure<-df_2[df_2$day_event == row$day_event & df_2$day_measure == row$day_event + 7,]$measure
      df_2[i,]<-row
    }
}

It worked :) But on my real data set which is larger it takes forever :(

I think R doesn't like such loops ! Can you think of another method ?

Thanks for your help !

Nicolas N
  • 177
  • 1
  • 1
  • 8
  • Maybe the fill function also from the tidyr package will do the trick – Dave2e Apr 13 '17 at 16:40
  • It is known that indexing R data.frames is slow, and that for-loops are very slow in R, so the combination of the two would make for a slow process. It is better practice to use the `apply` family of functions. See: http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r – Jon Apr 18 '17 at 15:28
  • Would _last observation carried forward_ do it for you? For instance, `na.locf()` from the `zoo` package? – Uwe Apr 19 '17 at 16:41

3 Answers3

0

I'm sharing what someone from my organization answered : Yes the solution was to use apply Here is how :

df_temp <- df_2 %>% 
  dplyr::filter(day_event < day_measure - 7)

df_temp$measure <- apply(X = df_temp
                         , MARGIN = 1
                         , FUN = function(x) {
                            (df_2 %>% dplyr::filter(
                              day_event == x[[1]] & day_measure == (as.Date(x[[1]], format = "%Y-%m-%d") + 7)
                              ))$measure

                    })

df_2 <- rbind(df_2 %>% dplyr::filter(day_event >= day_measure - 7)
              , df_temp
)

My sample was only 42k rows but the for loop takes several hours This solution takes about 30s

Uwe
  • 41,420
  • 11
  • 90
  • 134
Nicolas N
  • 177
  • 1
  • 1
  • 8
0

There are functions available which are built for this particular purpose which is known as last observation carried forward. One of the functions is na.locf()from the zoo package:

With that, the complete issue becomes a one-liner (I'm using data.table here because I'm more fluent in and it's usually faster with larger data.tables):

library(data.table)
setDT(df_2)[order(day_event, day_measure), measure := zoo::na.locf(measure), by = day_event]

Here, the rows are ordered by event date and subsequently be measure date. Then, missing elements are filled by last observation carried forward. In addition, the whole operation is grouped by event date to make sure that no false values are being carried forward if the first measurement in each group already is NA.

This is even faster than the OP's own answer which can be demonstrated by a benchmark (using the microbenchmark package)

Benchmark results

#Unit: milliseconds
#    expr       min        lq      mean    median        uq       max neval cld
#    loop 20.867890 22.037188 23.052667 22.665122 23.510681 27.535109   100   c
#   apply  9.011630  9.498314  9.834324  9.752323  9.994688 12.862594   100  b 
# na.locf  1.971389  2.132780  2.211467  2.226080  2.290762  2.656973   100 a  

Benchmark code

As all 3 methods change the data in place we need to keep a copy of the original data.

library(data.table)
df_0 <- copy(df_2)
library(tidyr)

microbenchmark::microbenchmark(
  loop = {
    df_2 <- copy(df_0)
    for (i in 1:length(df_2$measure)){
      row <- df_2[i,]
      if (row$day_event +7 < row$day_measure & length(df_2[df_2$day_event == row$day_event & df_2$day_measure == row$day_event + 7,]$measure)>0){
        row$measure<-df_2[df_2$day_event == row$day_event & df_2$day_measure == row$day_event + 7,]$measure
        df_2[i,]<-row
      }
    }
  },
  apply = {
    df_2 <- copy(df_0)
    df_temp <- df_2 %>% 
      dplyr::filter(day_event < day_measure - 7)

    df_temp$measure <- apply(X = df_temp
                             , MARGIN = 1
                             , FUN = function(x) {
                               (df_2 %>% dplyr::filter(
                                 day_event == x[[1]] & day_measure == (as.Date(x[[1]], format = "%Y-%m-%d") + 7)
                               ))$measure

                             })

    df_2 <- rbind(df_2 %>% dplyr::filter(day_event >= day_measure - 7)
                  , df_temp
    )

  },
  na.locf = {
    df_2 <- copy(df_0)
    df_2[order(day_event, day_measure), measure := zoo::na.locf(measure), by = day_event]
  })
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

I'm adding another solution proposed by someone else in my organisation : This solution is based on dplyr and seems to be faster than the apply solution I gave last week

library(tidyr)
day_event<- as.Date("2017-03-01") + 0:6
a<-rep(1,7)
b<-as.numeric(c("", rep(1,6)))
c<-as.numeric(c("","",rep(1,5)))
df_1<-data.frame(day_event,a,b,c)
names(df_1)[2]<-"2017-03-08"
names(df_1)[3]<-"2017-03-09"
names(df_1)[4]<-"2017-03-10"

df_1

df_2<-gather(df_1, day_measure, measure, -day_event)



fill_measure <- function(day_event, day_measure, measure){
  # return a modified measure vector
  # day_event should have only a single value here

  # test if correct day_measure exist
  if (any(day_measure == day_event + 7)){
    rst <- measure
    rst[day_measure > day_event + 7] <- measure[day_measure == day_event + 7]
  }else{
    rst <- measure
  }

  return(rst)
}

test <- df_2 %>% 
  dplyr::group_by(day_event) %>% 
  dplyr::mutate(measure_new = fill_measure(day_event, day_measure, measure)) %>% 
  dplyr::ungroup()
Uwe
  • 41,420
  • 11
  • 90
  • 134
Nicolas N
  • 177
  • 1
  • 1
  • 8