1

Given some data like the following:

set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))

# A tibble: 12 x 2
#   class  value
#   <chr>  <dbl>
# 1 a     -1.21 
# 2 a      0.277
# 3 a      1.08 
# 4 a     -2.35 
# 5 a      0.429
# 6 a      0.506
# 7 b      0.943
# 8 b      0.945
# 9 b      0.944
#10 b      0.911
#11 b      0.952
#12 b      0.900

I'm trying to generate a new column (context) that contains the average of "value" of the X preceding and posterior rows, when possible. It would be desirable to have this by level of a factor in a different column. For example, for X=2, I would expect something like the following:

# A tibble: 12 x 2
#   class  value  context
#   <chr>  <dbl>  <dbl>
# 1 a     -1.21     NA
# 2 a      0.277    NA
# 3 a      1.08     -0.7135
# 4 a     -2.35     0.573
# 5 a      0.429    NA
# 6 a      0.506    NA
# 7 b      0.943    NA
# 8 b      0.945    NA
# 9 b      0.944    0.9377
#10 b      0.911    0.9278
#11 b      0.952    NA
#12 b      0.900    NA

Note that for the first two rows it is not possible to generate the context value in this case, because they do not have X=2 predecing rows. The value -0.7135 at row 3 is the average of rows 1, 2, 4 and 5.

Similarly, rows 5 and 6 do not have a value of context, because these do not have two values afterwards belonging to the same level of the factor "class" (because row 7 is class="b" while 5 and 6 are class="a").

I do not know if this is even possible in R, I haven't found any similar questions, and I can only reach to solutions like the following one, which I think is not representative of this language.

My solution:

X <- 2
df_list <- df %>% dplyr::group_split(class)
result <- tibble()
for (i in 1:length(df_list)) {
  tmp <- df_list[[i]]
  context <- vector()
  for (j in 1:nrow(tmp)) {
    if (j<=X | j>nrow(tmp)-X) context <- c(context, NA)
    else {
      values <- vector()
      for (k in 1:X) {
        values <- c(values, tmp$value[j-k], tmp$value[j+k])
      }
      context <- c(context, mean(values))
    }
  }
  tmp <- tmp %>% dplyr::mutate(context=context)
  result <- result %>% dplyr::bind_rows(tmp)
}

This will give and approximate solution to that above (differences due to rounding). But again, this approach lacks of flexibility, e.g. if we want to create various columns at once, for different values of X. Are there R functions developed to solved tasks like this one? (eg. vectorized functions?)

elcortegano
  • 2,444
  • 11
  • 40
  • 58

3 Answers3

2
# this is your dataframe
set.seed(1234)
df <- tibble(class = rep(c("a","b"), each=6), value = c(rnorm(n=6, mean=0, sd=1), rnorm(n=6, mean=1, sd=0.1)))

# pipes ('%>%') and grouping from the dplyr package
library(tidyverse)
# rolling mean function from the zoo package
library(zoo)

df %>% # take df
    group_by(class) %>% # group it by class
    mutate(context = (rollsum(value, 5, fill = NA) - value) / 4) # and calculate the rolling mean

Basically you calculate a rolling mean with a window width of 5, that is center (it's the default) and you fill the remaining values with NAs. Since the value of the exact row is not to be included in the average, it needs to be excluded.

Georgery
  • 7,643
  • 1
  • 19
  • 52
1

One way using dplyr :

n <- 2
library(dplyr)

df %>%
  group_by(class) %>%
  mutate(context = map_dbl(row_number(), ~ if(.x <= n | .x > (n() - n)) 
         NA else mean(value[c((.x - n):(.x - 1), (.x + 1) : (.x + n))])))

#  class  value context
#  <chr>  <dbl>   <dbl>
# 1 a     -1.21   NA    
# 2 a      0.277  NA    
# 3 a      1.08   -0.712
# 4 a     -2.35    0.574
# 5 a      0.429  NA    
# 6 a      0.506  NA    
# 7 b      0.943  NA    
# 8 b      0.945  NA    
# 9 b      0.944   0.938
#10 b      0.911   0.935
#11 b      0.952  NA    
#12 b      0.900  NA    
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Here is a base R solution using ave(), i.e.,

df <- within(df,
       contest <- ave(value,
                      class,
                      FUN = function(v,X=2) sapply(seq(v), function(k) ifelse(k-X < 1  | k+X >length(v),NA,mean(v[c(k-(X:1),k + (1:X))])))))

such that

> df 
# A tibble: 12 x 3
   class  value contest
   <chr>  <dbl>   <dbl>
 1 a     -1.21   NA    
 2 a      0.277  NA    
 3 a      1.08   -0.712
 4 a     -2.35    0.574
 5 a      0.429  NA    
 6 a      0.506  NA    
 7 b      0.943  NA    
 8 b      0.945  NA    
 9 b      0.944   0.938
10 b      0.911   0.935
11 b      0.952  NA    
12 b      0.900  NA    
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81