Sorry for not providing data. Here is some sample data:
PERCENT <- rnorm(100, sd = 3)
YEAR <- sample(c(1950, 1958, 1963, 1974, 1982, 1994), 100, replace = TRUE)
AGE <- sample(c(18:90), 100, replace = TRUE)
COUNTRY <- rep(c("Country A", "Country B"), 50)
df <- data.frame(PERCENT, YEAR, AGE, COUNTRY)
I am trying to track age cohorts over time. To that end, I would like to give each case a unique ID for their age cohort. I know how to do this manually as shown here:
df %>%
filter(AGE >= 18 & AGE <= 27, YEAR == 1950 |
AGE >= 26 & AGE <= 36, YEAR == 1958 |
AGE >= 31 & AGE <= 40, YEAR == 1963 |
AGE >= 42 & AGE <= 51, YEAR == 1974 |
AGE >= 50 & AGE <= 59, YEAR == 1982 |
AGE >= 60 & AGE <= 69, YEAR == 1994) %>%
mutate(COHORT_ID = "18-27 in 1950")
But to do this for several age cohorts takes a lot of typing. I am trying to do a loop or function, which assigns a cohort label to all people between ages x and y in year t and to people in x+u to y+u at year t+u.
I have tried to do a function that takes as arguments a vector of minimum age, a vector of maximum age and a vector of the year of the survey wave as arguments and adds a label to a new column in the dataframe.
Here is what I came up with so far:
function(xmin, xmax, year) {
df$cohort <- 0, #to initialize the column
### here the magic happens
}
I checked out this page but they seem to be talking about something else.
If there is an efficient way to do this without using a function, I would be equally appreciative! Thanks in advance!
EDIT: I just realized that each observation could fall into several cohort categories since the age brackets (10 years) and the survey waves (irregular intervals) do not line up. Will a dummy variable for each cohort ID solve this?