Rolling mean (moving average) by group/id with dplyr

Question

I have a longitudinal follow-up of blood pressure recordings.

The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like

test <- read.table(header=TRUE, text = "
  ID  AGE   YEAR_VISIT  BLOOD_PRESSURE  TREATMENT
  1 20  2000    NA 3
  1 21  2001    129 2
  1 22  2002    145 3
  1 22  2002    130 2
  2 23  2003    NA  NA
  2 30  2010    150 2
  2 31  2011    110 3
  4 50  2005    140 3
  4 50  2005    130 3
  4 50  2005    NA  3
  4 51  2006    312 2
  5 27  2010    140 4
  5 28  2011    170 4
  5 29  2012    160 NA
  7 40  2007    120 NA
                   ")

I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:

A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.

I've tried the following:

test2 <- test %>%
  group_by(ID) %>%
  arrange(ID, YEAR_VISIT) %>%
  mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()

I have also tried rollaply and rollmeanr without succeeding.

When calculating moving average, number of elements returned is less than number of rows of the data, i.e. only "n-1" elements are returned. Thus may be causing the problem here. Or would you consider adding the moving average column separately, like: test2$BLOOD_PRESSURE_UPDATED <- with(test2, c(mean(BLOOD_PRESSURE, na.rm = T), rollapply(BLOOD_PRESSURE, 2, mean, na.rm = T))) — KFB, Oct 05 '14 at 03:40
Thanks for the effort KFB. Unfortunately it did not work. I tried a few edited versions as well. Perhaps the zoo-functions are not suitable for this? I have coded the following that does work: test5 <- test test5$UM <- rep(NA, nrow(test5)) test5$first <- !duplicated(test5$ID) for(i in 1:nrow(test5)){ if(test5$first[i]){ test5$UM[i] <- test5$BLOOD_PRESSURE[i] }else{ test5$UM[i] <- mean(c(test5$BLOOD_PRESSURE[i] , test5$UM[i-1]), na.rm=TRUE) } } test5 But it's unbelievably slow. — Adam Robinsson, Oct 05 '14 at 07:09
Just curious, why use two kinds of missing value imputation? It seems like the imputation for the missing value of the first observation (impute using current one or overall group mean) and other observations (impute use the nearest previous one) are different. — Jia Gao, Jul 21 '22 at 23:56

hyunwoo jeong · Answer 1 · 2016-10-05T01:48:58.903

35

How about this?

    library(dplyr)   
    test2<-arrange(test,ID,YEAR_VISIT) %>% 
           mutate(lag1=lag(BLOOD_PRESSURE),
                  lag2=lag(BLOOD_PRESSURE,2),
                  movave=(lag1+lag2)/2)

Another solution using 'rollapply' function in zoo package (I like more)

library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
       mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))

edited Oct 05 '16 at 01:48

answered Mar 07 '16 at 06:46

hyunwoo jeong

1,534
1
15
14

1

Note that the align argument could be dropped if rollapplyr were used. – G. Grothendieck Sep 21 '17 at 18:42
What about if I wanted to do a mean of the values 15th through 18th place from the current one? – SqueakyBeak Apr 05 '22 at 17:32
1

Always use `dplyr::mutate` to avoid unexpected results. – Jia Gao Jul 21 '22 at 23:43

score 27 · Answer 2 · answered May 24 '20 at 19:45

slider is a 'new-er' alternative that plays nicely with the tidyverse.

Something like this would do the trick

test2 <- test %>%
  group_by(ID) %>%
  arrange(ID, YEAR_VISIT) %>%
  mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()

jlhoward · Accepted Answer · 2014-10-05T18:57:19.227

If you are not committed to to dplyr this should work:

get.mav <- function(bp,n=2){
  require(zoo)
  if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
  bp <- na.locf(bp,na.rm=FALSE)
  if(length(bp)<n) return(bp)
  c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))  
}
test <- with(test,test[order(ID,YEAR_VISIT),])

test$BLOOD_PRESSURE_UPDATED <- 
  unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
#    ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1   1  20       2000             NA         3               134.6667
# 2   1  21       2001            129         2               131.8333
# 3   1  22       2002            145         3               137.0000
# 4   1  22       2002            130         2               137.5000
# 5   2  23       2003             NA        NA               130.0000
# 6   2  30       2010            150         2               140.0000
# 7   2  31       2011            110         3               130.0000
# ...

This works for moving averages > 2 as well.

And here's a data.table solution, which is likely to be much faster if your dataset is large.

library(data.table)
setDT(test)     # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
#    ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
#  1:  1  20       2000             NA         3               134.6667
#  2:  1  21       2001            129         2               131.8333
#  3:  1  22       2002            145         3               137.0000
#  4:  1  22       2002            130         2               137.5000
#  5:  2  23       2003             NA        NA               130.0000
#  6:  2  30       2010            150         2               140.0000
#  7:  2  31       2011            110         3               130.0000
# ...

Thanks @jlhoward! - it solved the problem but the data.table method (which was the faster of the two) was very slow (3 million rows, 15 minutes on a new MBP). But nevertheless, problem solved. — Adam Robinsson, Oct 05 '14 at 20:34
Thanks @jlhoward. This saved me long computation time...I was using ddply earlier and the time was real bad! — EsBee, Oct 22 '15 at 17:44
@jhoward new `frollmean` function should work as a drop-in replacement of `get.mav` here. more info in [`?froll`](https://rdatatable.gitlab.io/data.table/library/data.table/html/froll.html). — jangorecki, Dec 22 '18 at 15:11
or maybe not, not sure about this 1st element NA manipulation — jangorecki, Dec 22 '18 at 15:12

score 7 · Answer 4 · edited Jul 04 '19 at 14:49

7

Try this:

library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
       mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))

edited Jul 04 '19 at 14:49

Brett Rigby

6,101
10
46
76

answered Jul 04 '19 at 14:02

pier

81
1
3

2

You an also use the `rollmean` function in the last line instead: `rollmean(BLOOD_PRESSURE,2,align='right',fill=NA)` – Angie Aug 07 '19 at 13:21

Rolling mean (moving average) by group/id with dplyr

4 Answers4

Linked

Related