I have a dataframe df
which contains data on the date of vaccination of individuals with unique identification numbers. An individual is considered vaccinated if he has received three vaccinations by the age of two years old. My aim is to calculate the cumulative sum of fully vaccinated individuals with the end goal of plotting the proportion of the under population under three that is fully vaccinated and any given time x
. In my mind I've come up with the perfect code but my intuition is failing for some reason and I get a weird increase during the end of the time period. See below.
After considerable data wrangling we start the example data with dataframe df
where every row is a single vaccination event and a single column dataframe date
that contains every single date in the time period of interest.
glimpse(df)
Observations: 50,469
Variables: 6
$ id <chr> "1000038", "1000038", "1000038", "1000128", "1000380",...
$ n_max <int> 3, 1, 1, 3, 3, 3, 3,... ###total num times before 2 years old
$ age_y <int> 0, 0, 0, 0, 1, 0, 0,... ###current age for this observation
$ age_m <int> 3, 5, 11, 3, 4,... ###current age in months for this obs
$ date_vacc <date> 2013-05-08, 2013-07-03, 2014-01-13,... ###current date obs
$ year <dbl> 2013, 2013, 2014, 2013,... ###current year of obs
glimpse(date)
Observations: 4,017
Variables: 1
$ date_vacc <date> 2005-01-01, 2005-01-02, 2005-01-03, 2005-01-04, 2005-01-05, 2005-01-06, 2005-01-07, 2005-01-08, 2005-01-09, 20...
Now I exploit the structure of df
as to change what each row 'means'. At this point each row represents an observation of a single vaccine event, and the following code, first i)
makes each row represent the date at which the individual received his last vaccine dose, regardless if he received 1, 2 or three in total. Then ii)
it changes the meaning of the row to represent the number of individuals who received their last dose on a given date.
df <-
df[!duplicated(dfid, fromLast = TRUE),] %>% ###i)
droplevels() %>%
right_join(date) %>%
group_by(date_vacc) %>%
summarise(nsum = n_distinct(id, na.rm = TRUE)) ###ii)
df$nsum <-
ifelse(is.na(df$nsum),
0,
df$nsum)
Finally, this code is supposed to add together the number of people that have received their final dose over the last two years, and as a rolling sum, approximate the number of fully vaccinated two year olds in the population at any given date x
. Because it sums over a fixed interval of time I figure it should enter a steady state where the number of people 'exiting' is
lag_vacc <- 2 * 365.25
df$lagsum <- rep(NA, nrow(df))
for (i in (dim(df)[1] - (dim(df)[1] - lag_vacc)):dim(df)[1]) {
df$lagsum[i] <-
sum(df$nsum[(i - lag_vacc):i])
}
However, if I then plot this I get a very strange result that I can't for the life of me explain or correct.
ggplot(df,
aes(x = date_vacc, y = lagsum)) +
geom_point()
It hits the steady state, as predicted. Put then starts increasing again and ends up as 1.3 of the population, i.e. more people vaccinated than exist. This is no longer of any practical importance and even a silly way to represent this data. But I can't figure out where my reasoning is incorrect. Why doesn't this work? Is there a better way to do this?
EDIT: After several days of several hours each I think I finally figured this out. As a recap, the above code calculates a rolling cumulative sum of vaccinated individuals over time based on the date of their 'last' dose of a three dose vaccine. Summing the 'last' dose (representing the second or third dose depending on the situation) is desirable because two doses confer good protection for the first 4-5 years of life even without the third and last dose. Because there is a cut-off point at the end of the x-axis (31-12-2015) the individuals that would otherwise have received their third and 'last' dose after that point, instead are entering the cumulative sum prematurely because their second dose is identified as their 'last'.
EDIT2: The following code generates the population denominator to produce a very similar image as the one above - but transforming the y-axis into a proportion instead of a count.
df_pop <-
pop %>%
mutate(year = as.integer(year)) %>%
filter(grepl("all", pop$gender), age_y >= 1, age_y <= 2, year >= 2005) %>%
select(age_y, year, at_risk) %>%
group_by(year) %>%
summarise(n_atrisk = sum(at_risk))
df <-
df %>%
mutate(year = year(date_vacc)) %>%
left_join(df_pop) %>%
mutate(prop = lagsum / n_atrisk)
ggplot(df,
aes(x = date_vacc, y = prop)) +
geom_line() +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
scale_y_continuous(breaks = pretty(df$prop, n = 10)) +
theme_bw()
dput(head(df, n = 20))
structure(list(date_vacc = structure(c(12784, 12785, 12786, 12787,
12788, 12789, 12790, 12791, 12792, 12793, 12794, 12795, 12796,
12797, 12798, 12799, 12800, 12801, 12802, 12803), class = "Date"),
nsum = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), lagsum = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), year = c(2005,
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005,
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005), n_atrisk = c(8422L,
8422L, 8422L, 8422L, 8422L, 8422L, 8422L, 8422L, 8422L, 8422L,
8422L, 8422L, 8422L, 8422L, 8422L, 8422L, 8422L, 8422L, 8422L,
8422L), prop = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("date_vacc",
"nsum", "lagsum", "year", "n_atrisk", "prop"), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))