How to compute a custom mean for each row over multiple columns, based on a row-specific criterion?

Question

I want to compute the mean exposure to se ozone from a dataset with the example below. The mean value should be the ozone value from the year of birth to age 5. Is there a simple way to do this in R.

final = data.frame(ID = c(1, 2, 3, 4, 5, 6), 
                   Zone = c("A", "B", "C", "D", "A", "B"), 
                   dob = c(1993, 1997, 1994, 2001, 1999, 1993), 
                   Ozone_1993 = c(0.12, 0.01, 0.36, 0.78, 0.12, 0.01), 
                   Ozone_1994 = c(0.75, 0.23, 0.14, 0.98, 0.75, 0.23), 
                   Ozone_1995 = c(1.38, 0.45, -0.08, 1.18, 1.38, 0.45), 
                   Ozone_1996 = c(2.01, 0.67, -0.3, 1.38, 2.01, 0.67), 
                   Ozone_1997 = c(2.64, 0.89, -0.52, 1.58, 2.64, 0.89), 
                   Ozone_1998 = c(3.27, 1.11, -0.74, 1.78, 3.27, 1.11), 
                   Ozone_1999 = c(3.9, 1.33, -0.96, 1.98, 3.9, 1.33), 
                   Ozone_2000 = c(4.53, 1.55, -1.18, 2.18, 4.53, 1.55), 
                   Ozone_2001 = c(5.16, 1.77, -1.4, 2.38, 5.16, 1.77), 
                   Ozone_2002 = c(5.79, 1.99, -1.62, 2.58, 5.79, 1.99), 
                   Ozone_2003 = c(6.42, 2.21, -1.84, 2.78, 6.42, 2.21), 
                   Ozone_2004 = c(7.05, 2.43, -2.06, 2.98, 7.05, 2.43), 
                   mean_under5_ozone = c(0.85, 1.33, -0.3, 2.68, 5.16, 0.45))

where column (variable) mean_under5_ozone is the mean score of Ozone exposure from birthyear to age 5 or less. e.g mean_under5_ozone for ID 1 is the rowmean from Ozone_1993 to Ozone_1997

From a novice,

Please do not post an image of code/data/errors: it cannot be copied or searched (SEO), it breaks screen-readers, and it may not fit well on some mobile devices. Ref: https://meta.stackoverflow.com/a/285557 (and https://xkcd.com/2116/). Please just include the code, console output, or data (e.g., `dput(head(x))` or `data.frame(...)`) directly. — r2evans, Oct 02 '20 at 18:40
Have you tried anything? It often helps to tailor suggestions to your current coding style (including preferred "dialects" of R, such as base R, `dplyr`, or `data.table`). — r2evans, Oct 02 '20 at 18:42
@r2evans because I am new to R I do must of my coding basic an examples i find on what I want to do. I have used dplyr in my coding but not data.table. Although, someone recently suggested datable may be good given the size of the original data I am working with (over 10 million observations , unique observations would be rather because the data is made up of individuals with multiple record data) — Ekow_ababio, Oct 02 '20 at 19:09
While base R and `dplyr` can both deal with 10M rows, if your work is extensive then `data.table` might offer some advantages, assuming you're comfortable with its nuances. However, being able to do something "slower" (relatively) in `dplyr` is still better than not knowing how well to do something "faster" in `data.table`. — r2evans, Oct 02 '20 at 19:13
@r2evans thanks for the advice. I just edited the question. Is it better now? — Ekow_ababio, Oct 02 '20 at 19:15
@r2evans I've added the intended output as the final column. The figures does not correspond to the correct mean Ozone score — Ekow_ababio, Oct 02 '20 at 19:34
The column (mean_under5_ozone) of the new code has the correct mean ozone score based within 5 year period from the year of birth — Ekow_ababio, Oct 02 '20 at 19:52

r2evans · Answer 1 · 2020-10-02T20:16:35.747

(Complete rewrite.)

I don't think I understand what mean_under5_ozone means, since I can't reproduce your numbers. For instance, for ID==1, born in 1993, that means we want data from 1993 through 1998 (to include age 5) or 1997 (up to but not including), but neither of those averages is 0.85:

mean(unlist(final[1, 4:9]))
# [1] 1.695
mean(unlist(final[1, 4:8]))
# [1] 1.38

Ignoring this, I'll give you what I think are the correct answers with your final data.

tidyverse

library(dplyr)
library(tidyr) # pivot_longer
final <- select(final, -mean_under5_ozone)
final %>%
  pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>%
  mutate(year = as.integer(year)) %>%
  group_by(ID) %>%
  summarize(mean_under5_ozone = mean(value[ between(year, dob, dob + 5) ]), .groups = "drop")
# # A tibble: 6 x 2
#      ID mean_under5_ozone
#   <dbl>             <dbl>
# 1     1              1.70
# 2     2              1.44
# 3     3             -0.41
# 4     4              2.68
# 5     5              5.48
# 6     6              0.56

data.table

library(data.table)
library(magrittr) # %>%, not required but used for improved readability
finalDT[, mean_under5_ozone := NULL]
melt(finalDT, 1:3) %>%
  .[, year := as.integer(gsub("[^0-9]", "", variable))] %>%
  .[ year >= dob, ] %>%
  .[, .(mean_under5_ozone = mean(value[ between(year, dob, dob + 5) ])), by = .(ID)] %>%
  .[order(ID),]
#    ID mean_under5_ozone
# 1:  1             1.695
# 2:  2             1.440
# 3:  3            -0.410
# 4:  4             2.680
# 5:  5             5.475
# 6:  6             0.560

A few thoughts, using random data.

set.seed(42)
dat <- data.frame(dob = sample(1990:2020, size=1000, replace=TRUE), Ozone_1993=runif(1000), Ozone_1994=runif(1000), Ozone_1995=runif(1000))
head(dat)
#    dob Ozone_1993 Ozone_1994 Ozone_1995
# 1 2006 0.37383448 0.68624969  0.1681480
# 2 1994 0.46496563 0.29309851  0.8198724
# 3 1990 0.04660819 0.41994895  0.7501070
# 4 2014 0.98751620 0.73526105  0.2899959
# 5 1999 0.90845233 0.84982125  0.1798130
# 6 1993 0.97939015 0.07746459  0.6172919

tidyverse

library(dplyr)
dat %>%
  filter(dob >= 2015) %>%
  summarize_at(vars(starts_with("Ozone")), mean)
#   Ozone_1993 Ozone_1994 Ozone_1995
# 1  0.5242029  0.4852803  0.4864364

That is the average per year. If you instead need a single statistic, then

# library(tidyr) # pivot_longer
dat %>%
  filter(dob >= 2015) %>%
  tidyr::pivot_longer(starts_with("Ozone")) %>%
  summarize(value = mean(value))
# # A tibble: 1 x 1
#   value
#   <dbl>
# 1 0.499

data.table

library(data.table)
datDT <- as.data.table(dat)
datDT[ dob >= 2015, ][, lapply(.SD, mean), .SDcols = patterns("^Ozone")]
#    Ozone_1993 Ozone_1994 Ozone_1995
# 1:  0.5242029  0.4852803  0.4864364
melt(datDT[ dob >= 2015, ], "dob")[, .(value = mean(value))]
#        value
# 1: 0.4986398

Base R

apply(subset(dat, dob >= 2015, select = Ozone_1993:Ozone_1995), 2, mean)
# Ozone_1993 Ozone_1994 Ozone_1995 
#  0.5242029  0.4852803  0.4864364 
mean(unlist(subset(dat, dob >= 2015, select = Ozone_1993:Ozone_1995)))
# [1] 0.4986398

This is really helpful. Is there a way to capture the single statistics for each individual as a new variable in the dataframe. E.g. in your example dataframe the person with id 1 was born in 2006 hence the summary statistics should be ozone from 2006 to 2010. — Ekow_ababio, Oct 02 '20 at 19:05
I think you need to [edit] your question and make this question a little more *reproducible*, sample *unambiguous* data (e.g., `dput(head(x))` or `data.frame(x=...,y=...)`) and intended output given that input. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. — r2evans, Oct 02 '20 at 19:08
mean_under5_ozone is supposed to be the rowmean of ozone score for a five year period from the year the person was born. You are right values there do not correspond to the actual or valid mean. I inserted fictitious values as an example. Thanks r2evans for the helpful code. I will try both your code and KM_83's code tomorrow on the real data. I am grateful for your help, much appreciated. — Ekow_ababio, Oct 02 '20 at 22:08
when I tried both the tidyverse and data.table versions I got this error message error: expecting a single value: [extent=93] — Ekow_ababio, Oct 04 '20 at 17:35
I have no idea, it doesn't do that for me given this sample data. — r2evans, Oct 04 '20 at 18:34

KM_83 · Accepted Answer · 2020-10-02T20:05:43.910

Here is one way to do it with for loops. (It's not very elegant, but it avoids getting into too much details of dplyr and rlang syntax.)

loop over birth years (dob_yr below) to define a column containing variable names to use for the custom mean (use_vars below).
loop over rows and for each row, extract relevant variables using this new column (use_vars) and calculate the custom mean.

library(dplyr)

df <- tibble(id=1:5)
df$zone <- c(rep('A', 5))
df$dob_yr <- c(1991:1995)
for (yr in 1991:1995) {
  df[[paste('x_',yr,sep='')]] <- c(abs(rnorm(5)))
}
df # check mock data 

add_use_vars <- function(df, dob_yr_varname='dob_yr', prefix='x_', yr_within=3) {
  vars <- names(df %>% select(starts_with(prefix)))
  vars_yr <- as.integer(sub(prefix, '', vars))
  df$use_vars <- NA
  for (i in seq_along(df[[dob_yr_varname]])) {
      yr  <- df[[dob_yr_varname]][i]
      idx <- (vars_yr <= yr + yr_within) & (vars_yr >= yr)
      df$use_vars[i] <- list(vars[idx]) # list with one element 
  }
  return(df)
}

df <- add_use_vars(df)

df$use_vars[1][[1]] # see the first row in use_vars

custom_mean <- function(df, varname_varlist='use_vars') {
  df$custom_mean <- NA
  for (i in seq_along(df[[varname_varlist]])) {
    vars = df[[varname_varlist]][i][[1]] # extract first element in list
    df$custom_mean[i] <- mean(as.numeric(df[i, vars]))
  }
  return(df)
}
df <- custom_mean(df)

df # see results

Note that for this mock data, for each row, I am averaging over the columns containing value of 0 to 3 years from the birth year.

awesome! the function result in the example data gives the exact output I need. I will try it on the real dataset tomorrow. A million thanks — Ekow_ababio, Oct 02 '20 at 20:54

How to compute a custom mean for each row over multiple columns, based on a row-specific criterion?

2 Answers2

tidyverse

data.table

tidyverse

data.table

Base R

Linked