2

My goal is to obtain the cum mean (and cumsd) of a dataframe while ignoring NA and filling those with the previous cum means:

df:

var1   var2  var3   
x1     y1    z1    
x2     y2    z2
NA     NA    NA
x3     y3    z3    

cummean:

var1           var2         var3   
 x1/1          y1/1          z1/1    
(x1+x2)/2     (y1+y2)/2     (z1+z2)/2
(x1+x2)/2     (y1+y2)/2     (z1+z2)/2
(x1+x2+x3)/3  (y1+y2+y3)/3  (z1+z2+z3)/3 

So for row 3 where df has NA, I want the new matrix to contain the cum mean from the line above (numerator should not increase).

So far, I am using this to compute the cum mean (I am aware that somewhere a baby seal gets killed because I used a for loop and not something from the apply family)

for(i in names(df){
  df[i][!is.na(df[i])] <- GMCM:::cummean(df[i][!is.na(df[i])])
}

I have also tried this:

setDT(posRegimeReturns)    
cols<-colnames((posRegimeReturns))    
posRegimeReturns[, (cols) := lapply(.SD,  cummean) , .SD = cols]

But both of those leave the NAs empty.

Note: this question is similar to this post Calculate cumsum() while ignoring NA values but unlike the solution there, I don't want to leave the NAs but rather fill those with the same values as the last row above that was not NA.

Niccola Tartaglia
  • 1,537
  • 2
  • 26
  • 40

2 Answers2

2

You might want to use the definition of variance to calculate this

library(data.table)
dt <- data.table(V1=c(1,2,NA,3), V2=c(1,2,NA,3), V3=c(1,2,NA,3))

cols <- copy(names(dt))

#means
dt[ , paste0("mean_",cols) := lapply(.SD, function(x) {
    #get the num of non-NA observations
    lens <- cumsum(!is.na(x))

    #set NA to 0 before doing cumulative sum
    x[is.na(x)] <- 0
    cumsum(x) / lens
}), .SDcols=cols]

#sd
dt[ , paste0("sd_",cols) := lapply(.SD, function(x) {
    lens <- cumsum(!is.na(x))
    x[is.na(x)] <- 0

    #use defn of variance mean of sum of squares minus square of means and also n-1 in denominator
    sqrt(lens/(lens-1) * (cumsum(x^2)/lens - (cumsum(x) / lens)^2))
}), .SDcols=cols]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
2

Using data table. In particular:

 library(data.table)
 DT <- data.table(z = sample(N),idx=1:N,key="idx")

     z  idx
 1:  4   1
 2: 10   2
 3:  9   3
 4:  6   4
 5:  1   5
 6:  8   6
 7:  3   7
 8:  7   8
 9:  5   9  
10:  2  10

We now make use of the use of -apply function and data.table.

DT[,cummean:=sapply(seq(from=1,to=nrow(DT)) ,function(iii) mean(DT$z[1:iii],na.rm = TRUE))]
DT[,cumsd:=sapply(seq(from=1,to=nrow(DT)) ,function(iii) sd(DT$z[1:iii],na.rm = TRUE))]

resulting in:

             z idx  cummean    cumsd
         1:  4   1 4.000000       NA
         2: 10   2 7.000000 4.242641
         3:  9   3 7.666667 3.214550
         4:  6   4 7.250000 2.753785
         5:  1   5 6.000000 3.674235
         6:  8   6 6.333333 3.386247
         7:  3   7 5.857143 3.338092
         8:  7   8 6.000000 3.116775
         9:  5   9 5.888889 2.934469
        10:  2  10 5.500000 3.027650
amonk
  • 1,769
  • 2
  • 18
  • 27