-1

Sorry for not providing data. Here is some sample data:

PERCENT <- rnorm(100, sd = 3)
YEAR <- sample(c(1950, 1958, 1963, 1974, 1982, 1994), 100, replace = TRUE)
AGE <- sample(c(18:90), 100, replace = TRUE)
COUNTRY <- rep(c("Country A", "Country B"), 50)
df <- data.frame(PERCENT, YEAR, AGE, COUNTRY)

I am trying to track age cohorts over time. To that end, I would like to give each case a unique ID for their age cohort. I know how to do this manually as shown here:

df %>% 
  filter(AGE >= 18 & AGE <= 27, YEAR == 1950 | 
         AGE >= 26 & AGE <= 36, YEAR == 1958 |
         AGE >= 31 & AGE <= 40, YEAR == 1963 |
         AGE >= 42 & AGE <= 51, YEAR == 1974 | 
         AGE >= 50 & AGE <= 59, YEAR == 1982 |
         AGE >= 60 & AGE <= 69, YEAR == 1994) %>%   
  mutate(COHORT_ID = "18-27 in 1950")

But to do this for several age cohorts takes a lot of typing. I am trying to do a loop or function, which assigns a cohort label to all people between ages x and y in year t and to people in x+u to y+u at year t+u.

I have tried to do a function that takes as arguments a vector of minimum age, a vector of maximum age and a vector of the year of the survey wave as arguments and adds a label to a new column in the dataframe.

Here is what I came up with so far:

function(xmin, xmax, year) {
  df$cohort <- 0, #to initialize the column
  ### here the magic happens
  }

I checked out this page but they seem to be talking about something else.

If there is an efficient way to do this without using a function, I would be equally appreciative! Thanks in advance!

EDIT: I just realized that each observation could fall into several cohort categories since the age brackets (10 years) and the survey waves (irregular intervals) do not line up. Will a dummy variable for each cohort ID solve this?

Tea Tree
  • 882
  • 11
  • 26
  • 1
    how does you data look like? why is the year a string and not a numeric? You could substract the age from the year so you would get the birthyear which could be in a certain range and use it to assign the cohort – Linus Dec 17 '17 at 07:48

1 Answers1

0

I'm not entirely sure I understand your problem; so the following is based on my interpretation of what you're trying to achieve.

We first set a reference year, based upon which we express the different AGE values at different YEARs. Here I choose the max(df$YEAR) as reference year.

maxYEAR <- max(df$YEAR);
maxYEAR;
#[1] 1994

# Calculate age at reference year maxYEAR
df$normAGE <- maxYEAR - df$YEAR + df$AGE;

We can then bin the normalised age values (at reference year 1994) using cut.

# Bin normalised years in 10 year bins
df$ageBin <- cut(df$normAGE, breaks = seq(0, max(df$normAGE) + 10, by = 10));
head(df);
#     PERCENT YEAR AGE   COUNTRY normAGE    ageBin
#1  4.3026044 1974  41 Country A      61   (60,70]
#2 -0.2318759 1982  44 Country B      56   (50,60]
#3  2.2174117 1994  47 Country A      47   (40,50]
#4 -5.2758142 1994  43 Country B      43   (40,50]
#5 -0.2094757 1963  71 Country A     102 (100,110]
#6  1.3557166 1982  48 Country B      60   (50,60]

If necessary, we can get the bin number with as.numeric(df$ageBin).


Sample data

# Sample data
set.seed(2017);
PERCENT <- rnorm(100, sd = 3)
YEAR <- sample(c(1950, 1958, 1963, 1974, 1982, 1994), 100, replace = TRUE)
AGE <- sample(c(18:90), 100, replace = TRUE)
COUNTRY <- rep(c("Country A", "Country B"), 50)
df <- data.frame(PERCENT, YEAR, AGE, COUNTRY)
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68