0

I want to compute the mean exposure to se ozone from a dataset with the example below. The mean value should be the ozone value from the year of birth to age 5. Is there a simple way to do this in R.

final = data.frame(ID = c(1, 2, 3, 4, 5, 6), 
                   Zone = c("A", "B", "C", "D", "A", "B"), 
                   dob = c(1993, 1997, 1994, 2001, 1999, 1993), 
                   Ozone_1993 = c(0.12, 0.01, 0.36, 0.78, 0.12, 0.01), 
                   Ozone_1994 = c(0.75, 0.23, 0.14, 0.98, 0.75, 0.23), 
                   Ozone_1995 = c(1.38, 0.45, -0.08, 1.18, 1.38, 0.45), 
                   Ozone_1996 = c(2.01, 0.67, -0.3, 1.38, 2.01, 0.67), 
                   Ozone_1997 = c(2.64, 0.89, -0.52, 1.58, 2.64, 0.89), 
                   Ozone_1998 = c(3.27, 1.11, -0.74, 1.78, 3.27, 1.11), 
                   Ozone_1999 = c(3.9, 1.33, -0.96, 1.98, 3.9, 1.33), 
                   Ozone_2000 = c(4.53, 1.55, -1.18, 2.18, 4.53, 1.55), 
                   Ozone_2001 = c(5.16, 1.77, -1.4, 2.38, 5.16, 1.77), 
                   Ozone_2002 = c(5.79, 1.99, -1.62, 2.58, 5.79, 1.99), 
                   Ozone_2003 = c(6.42, 2.21, -1.84, 2.78, 6.42, 2.21), 
                   Ozone_2004 = c(7.05, 2.43, -2.06, 2.98, 7.05, 2.43), 
                   mean_under5_ozone = c(0.85, 1.33, -0.3, 2.68, 5.16, 0.45))

where column (variable) mean_under5_ozone is the mean score of Ozone exposure from birthyear to age 5 or less. e.g mean_under5_ozone for ID 1 is the rowmean from Ozone_1993 to Ozone_1997

From a novice,

KM_83
  • 697
  • 3
  • 9
Ekow_ababio
  • 163
  • 9
  • 3
    Please do not post an image of code/data/errors: it cannot be copied or searched (SEO), it breaks screen-readers, and it may not fit well on some mobile devices. Ref: https://meta.stackoverflow.com/a/285557 (and https://xkcd.com/2116/). Please just include the code, console output, or data (e.g., `dput(head(x))` or `data.frame(...)`) directly. – r2evans Oct 02 '20 at 18:40
  • 1
    Have you tried anything? It often helps to tailor suggestions to your current coding style (including preferred "dialects" of R, such as base R, `dplyr`, or `data.table`). – r2evans Oct 02 '20 at 18:42
  • 1
    thank @r2evans for suggestions and tips on posting here. – Ekow_ababio Oct 02 '20 at 19:02
  • @r2evans because I am new to R I do must of my coding basic an examples i find on what I want to do. I have used dplyr in my coding but not data.table. Although, someone recently suggested datable may be good given the size of the original data I am working with (over 10 million observations , unique observations would be rather because the data is made up of individuals with multiple record data) – Ekow_ababio Oct 02 '20 at 19:09
  • While base R and `dplyr` can both deal with 10M rows, if your work is extensive then `data.table` might offer some advantages, assuming you're comfortable with its nuances. However, being able to do something "slower" (relatively) in `dplyr` is still better than not knowing how well to do something "faster" in `data.table`. – r2evans Oct 02 '20 at 19:13
  • @r2evans thanks for the advice. I just edited the question. Is it better now? – Ekow_ababio Oct 02 '20 at 19:15
  • What's your expected output? – r2evans Oct 02 '20 at 19:17
  • @r2evans I've added the intended output as the final column. The figures does not correspond to the correct mean Ozone score – Ekow_ababio Oct 02 '20 at 19:34
  • The column (mean_under5_ozone) of the new code has the correct mean ozone score based within 5 year period from the year of birth – Ekow_ababio Oct 02 '20 at 19:52

2 Answers2

0

(Complete rewrite.)

I don't think I understand what mean_under5_ozone means, since I can't reproduce your numbers. For instance, for ID==1, born in 1993, that means we want data from 1993 through 1998 (to include age 5) or 1997 (up to but not including), but neither of those averages is 0.85:

mean(unlist(final[1, 4:9]))
# [1] 1.695
mean(unlist(final[1, 4:8]))
# [1] 1.38

Ignoring this, I'll give you what I think are the correct answers with your final data.

tidyverse

library(dplyr)
library(tidyr) # pivot_longer
final <- select(final, -mean_under5_ozone)
final %>%
  pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>%
  mutate(year = as.integer(year)) %>%
  group_by(ID) %>%
  summarize(mean_under5_ozone = mean(value[ between(year, dob, dob + 5) ]), .groups = "drop")
# # A tibble: 6 x 2
#      ID mean_under5_ozone
#   <dbl>             <dbl>
# 1     1              1.70
# 2     2              1.44
# 3     3             -0.41
# 4     4              2.68
# 5     5              5.48
# 6     6              0.56

data.table

library(data.table)
library(magrittr) # %>%, not required but used for improved readability
finalDT[, mean_under5_ozone := NULL]
melt(finalDT, 1:3) %>%
  .[, year := as.integer(gsub("[^0-9]", "", variable))] %>%
  .[ year >= dob, ] %>%
  .[, .(mean_under5_ozone = mean(value[ between(year, dob, dob + 5) ])), by = .(ID)] %>%
  .[order(ID),]
#    ID mean_under5_ozone
# 1:  1             1.695
# 2:  2             1.440
# 3:  3            -0.410
# 4:  4             2.680
# 5:  5             5.475
# 6:  6             0.560

A few thoughts, using random data.

set.seed(42)
dat <- data.frame(dob = sample(1990:2020, size=1000, replace=TRUE), Ozone_1993=runif(1000), Ozone_1994=runif(1000), Ozone_1995=runif(1000))
head(dat)
#    dob Ozone_1993 Ozone_1994 Ozone_1995
# 1 2006 0.37383448 0.68624969  0.1681480
# 2 1994 0.46496563 0.29309851  0.8198724
# 3 1990 0.04660819 0.41994895  0.7501070
# 4 2014 0.98751620 0.73526105  0.2899959
# 5 1999 0.90845233 0.84982125  0.1798130
# 6 1993 0.97939015 0.07746459  0.6172919

tidyverse

library(dplyr)
dat %>%
  filter(dob >= 2015) %>%
  summarize_at(vars(starts_with("Ozone")), mean)
#   Ozone_1993 Ozone_1994 Ozone_1995
# 1  0.5242029  0.4852803  0.4864364

That is the average per year. If you instead need a single statistic, then

# library(tidyr) # pivot_longer
dat %>%
  filter(dob >= 2015) %>%
  tidyr::pivot_longer(starts_with("Ozone")) %>%
  summarize(value = mean(value))
# # A tibble: 1 x 1
#   value
#   <dbl>
# 1 0.499

data.table

library(data.table)
datDT <- as.data.table(dat)
datDT[ dob >= 2015, ][, lapply(.SD, mean), .SDcols = patterns("^Ozone")]
#    Ozone_1993 Ozone_1994 Ozone_1995
# 1:  0.5242029  0.4852803  0.4864364
melt(datDT[ dob >= 2015, ], "dob")[, .(value = mean(value))]
#        value
# 1: 0.4986398

Base R

apply(subset(dat, dob >= 2015, select = Ozone_1993:Ozone_1995), 2, mean)
# Ozone_1993 Ozone_1994 Ozone_1995 
#  0.5242029  0.4852803  0.4864364 
mean(unlist(subset(dat, dob >= 2015, select = Ozone_1993:Ozone_1995)))
# [1] 0.4986398
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • This is really helpful. Is there a way to capture the single statistics for each individual as a new variable in the dataframe. E.g. in your example dataframe the person with id 1 was born in 2006 hence the summary statistics should be ozone from 2006 to 2010. – Ekow_ababio Oct 02 '20 at 19:05
  • I think you need to [edit] your question and make this question a little more *reproducible*, sample *unambiguous* data (e.g., `dput(head(x))` or `data.frame(x=...,y=...)`) and intended output given that input. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Oct 02 '20 at 19:08
  • 1
    mean_under5_ozone is supposed to be the rowmean of ozone score for a five year period from the year the person was born. You are right values there do not correspond to the actual or valid mean. I inserted fictitious values as an example. Thanks r2evans for the helpful code. I will try both your code and KM_83's code tomorrow on the real data. I am grateful for your help, much appreciated. – Ekow_ababio Oct 02 '20 at 22:08
  • when I tried both the tidyverse and data.table versions I got this error message error: expecting a single value: [extent=93] – Ekow_ababio Oct 04 '20 at 17:35
  • I have no idea, it doesn't do that for me given this sample data. – r2evans Oct 04 '20 at 18:34
0

Here is one way to do it with for loops. (It's not very elegant, but it avoids getting into too much details of dplyr and rlang syntax.)

  • loop over birth years (dob_yr below) to define a column containing variable names to use for the custom mean (use_vars below).
  • loop over rows and for each row, extract relevant variables using this new column (use_vars) and calculate the custom mean.
library(dplyr)

df <- tibble(id=1:5)
df$zone <- c(rep('A', 5))
df$dob_yr <- c(1991:1995)
for (yr in 1991:1995) {
  df[[paste('x_',yr,sep='')]] <- c(abs(rnorm(5)))
}
df # check mock data 

add_use_vars <- function(df, dob_yr_varname='dob_yr', prefix='x_', yr_within=3) {
  vars <- names(df %>% select(starts_with(prefix)))
  vars_yr <- as.integer(sub(prefix, '', vars))
  df$use_vars <- NA
  for (i in seq_along(df[[dob_yr_varname]])) {
      yr  <- df[[dob_yr_varname]][i]
      idx <- (vars_yr <= yr + yr_within) & (vars_yr >= yr)
      df$use_vars[i] <- list(vars[idx]) # list with one element 
  }
  return(df)
}

df <- add_use_vars(df)

df$use_vars[1][[1]] # see the first row in use_vars

custom_mean <- function(df, varname_varlist='use_vars') {
  df$custom_mean <- NA
  for (i in seq_along(df[[varname_varlist]])) {
    vars = df[[varname_varlist]][i][[1]] # extract first element in list
    df$custom_mean[i] <- mean(as.numeric(df[i, vars]))
  }
  return(df)
}
df <- custom_mean(df)

df # see results

Note that for this mock data, for each row, I am averaging over the columns containing value of 0 to 3 years from the birth year.

KM_83
  • 697
  • 3
  • 9
  • awesome! the function result in the example data gives the exact output I need. I will try it on the real dataset tomorrow. A million thanks – Ekow_ababio Oct 02 '20 at 20:54