1

I have a complex problem and I will be grateful if someone can help me out. I have a dataframe made up of appended survey data for different countries in different years. In the said dataframe, I also have air quality measures for the neighbourhoods where respondents were selected. The air quality data is from 1998 to 2016.

My problem is I want to compute the row mean (or cumulative mean exposures) for each person base on the respondents' age and the air quality data years. My data frame looks like this

dat <- data.frame(ID=c(1:2000), dob = sample(1990:2020, size=2000, replace=TRUE),
              survey_year=rep(c(1998, 2006, 2008, 2014, 2019), times=80, each=5),
              CNT = rep(c('AO', 'GH', 'NG', 'SL', 'UG'), times=80, each=5),
              Ozone_1998=runif(2000), Ozone_1999=runif(2000), Ozone_2000=runif(2000),
              Ozone_2001=runif(2000), Ozone_2002=runif(2000), Ozone_2003=runif(2000),
              Ozone_2004=runif(2000), Ozone_2005=runif(2000), Ozone_2006=runif(2000),
              Ozone_2007=runif(2000), Ozone_2008=runif(2000), Ozone_2009=runif(2000),
              Ozone_2010=runif(2000), Ozone_2011=runif(2000), Ozone_2012=runif(2000),
              Ozone_2013=runif(2000), Ozone_2014=runif(2000), Ozone_2015=runif(2000),
              Ozone_2016=runif(2000))

In the example data frame above, all respondents in country Ao will have their cumulative mean air quality exposures restricted to the Ozone_1998 while respondents in country SL will have their mean calculated based on Ozone_1998 to Ozone_2014.

The next thing is for a person in country SL aged 15 years I want to their cumulative exposure to be from Ozone_2000 to Ozone_2014 (the 15 year period of their life include their birth year). A person aged 16 will have their mean from Ozone_1999 to Ozone_2014 etc.

Is their a way to do this complex task in R?

NB: Although my question is similar to another I posted (see link below), this task is much complex. I tried adapting the solution for my previous question but my attempts did not work. For instance, I tried

dat$mean_exposure = dat %>% pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>% 
  mutate(year = as.integer(year)) %>% group_by(ID) %>% 
  summarize(mean_under5_ozone = mean(value[ between(year, survey_year,survey_year + 0) ]), .groups = "drop")


but got an error 

*Error: Problem with `summarise()` input `mean_under5_ozone`.
x `left` must be length 1
i Input `mean_under5_ozone` is `mean(value[between(year, survey_year, survey_year + 0)])`.
i The error occurred in group 1: ID = 1.*

Link to the previous question

How to compute a custom mean for each row over multiple columns, based on a row-specific criterion?

Thank you

Ekow_ababio
  • 163
  • 9
  • What about filtering the data first and then just computing a straight mean? I know sometimes I fall in the trap of doing everything in one pipeline. It is sometimes easier to just break it up into a step or two. –  Feb 11 '21 at 17:07
  • @Adam Thanks, I am new to coding in R hence the challenge – Ekow_ababio Feb 11 '21 at 18:04

1 Answers1

1

The tidying step from your last question works well:

tidy_data = dat %>%
  pivot_longer(
    starts_with("Ozone"),
    names_pattern = "(.*)_(.*)",
    names_to = c(NA, "year"),
    values_to = "ozone"
  ) %>% 
  mutate(year = as.integer(year))

Now you can filter out the years you want to get mean exposure by country / age:

mean_lifetime_exposure = tidy_data %>%
  group_by(CNT, dob) %>%
  filter(year >= dob) %>%
  summarise(mean(ozone))

PS I'm sorry I don't quite understand your first question about country AO.

Edit:

Does this do what you wanted? The logic is a bit convoluted but the code is straightforward.

tidy_data_filtered = tidy_data %>%
  filter(
    !(CNT == "AO" & year != 1998),
    !(CNT == "SL" & !year %in% 1998:2014)
  )
Tarquinnn
  • 501
  • 3
  • 8
  • thanks. I will try the solution you provided. Regarding AO, the survey was done in 1998 hence air quality exposure for respondents from that country need to be confined to just 1998. There is no air quality data for years prior to 1998. Also, my outcome of interest (not in the sample data) occurred in the year of the survey. Meaning, air quality information for subsequent years is not relevant for modelling the association between the outcome and air quality exposure – Ekow_ababio Feb 11 '21 at 18:11
  • So in other words you need to filter out rows where year is not 1998 for AO and delete observations from after 2014 for SL? Won't these simply be missing from your input data? – Tarquinnn Feb 11 '21 at 19:05
  • Ideally, it such be missing from the input data. However, the pooled survey data and the air quality data comes from two different sources. I linked the two based on the x and y coordinates of the sampled sites (for the survey) and air quality data. The air quality data was original a geotiff or raster data for 19 year period (from 1998 to 2016). both the air quality data and survey datasets are from the same source as used in this study https://www.nature.com/articles/s41586-018-0263-3#:~:text=We%20find%20that%20a%2010,higher%20levels%20of%20household%20wealth. – Ekow_ababio Feb 11 '21 at 19:11
  • I meant "Ideally, it should be missing from ... ". Sorry for the typo – Ekow_ababio Feb 11 '21 at 19:26