How can I tidy student enrollment data on a per semester basis?

Question

I have a dataset that currently lists student information on a term basis (i.e., 201610, 201620, 201630, 201640, 201710, etc.) with suffix 10 = fall, 20 = winter, 30 = spring, and 40 = summer. Not all terms are necessarily listed for every student.

What I would like to do is identify the first term in which a student was enrolled, presumably the fall, as T1, and subsequent terms as T2, T3, etc. Since some students may take a winter summer term, I would like to identify those as T1_Winter, T2_Summer, etc.

I've been able to isolate the individual terms for which a student has enrolled, and have been able to identify the first, intermediate, and last terms as 1, 2, 3, etc. However, I can't manage to wrap my head around how to identify fall and spring as 1, 2, 3, 4, and the intermediary terms, winter and summer, and 1.5, 2.5, 3.5, 4.5, etc.

# Create the sample dataset
data <- data.frame(
  ID = c(1, 1, 1, 2, 2, 2, 2),
  RegTerm = c(201810, 201820, 201830, 201910, 201930, 201940, 202010))
)

# Isolate student IDs and terms
stdTerm <- subset(data, select = c("ID","RegTerm"))

# Sort according to ID and RegTerm
stdTerm <- stdTerm[
  with(stdTerm, order(ID, RegTerm)),
  ]

# Remove duplicate combinations of ID and term
y <- stdTerm[!duplicated(stdTerm[c(1,2)]),]

# Create an index to identify the term number
# for which a student enrolled
library(dplyr)
z <- y %>% 
  arrange(ID, RegTerm) %>% 
  group_by(ID) %>% 
  mutate(StdTermIndex = seq(n()))

Right now, it's identifying the progression of all terms for a student as 1, 2, 3, etc., but not winter and summer as intermediary terms. That is, if a student enrolled in fall and winter, winter will appear as 2 and spring will appear as 3.

In the sample data provided, I would like Student ID 1 to reflect 201810 as 1, 201820 as 1.5, and 201830 as 2, etc. Any suggestions or previous code I could reference to wrap my head around how I can code the intermediary semesters?

give us a sample data so we can better understand your problem — Felipe Alvarenga, Mar 27 '19 at 20:29
Also, check this out https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Felipe Alvarenga, Mar 27 '19 at 20:30
Thanks, @FelipeAlvarenga! My apologies as it's my first time posting here. I've included a sample dataset in my question and hope it clarifies the problem. — Anna K, Mar 27 '19 at 20:38

score 0 · Answer 1 · answered Mar 27 '19 at 23:08

I think a good way to do this would be to separate your RegTerm column into year and suffix and then apply some condition formula once you have the values split up.

The below code does that, we just have to then apply it to the whole column and do some rejigging.

paste(strsplit(as.character(201810), "")[[1]][1:4], collapse = ""))
# "2018"
paste(strsplit(as.character(201810), "")[[1]][5:6], collapse = ""))
# "10"

So to do it on the data frame you want to use something like lapply and then unlist the result and add a new column. After that you can change the values to numeric and then use some conditional statements in a mutate function to set the intermediary values etc.

z$year <- unlist(lapply(z$RegTerm, function(x) paste(strsplit(as.character(x), "")[[1]][1:4], collapse = "")))
z$suf <- unlist(lapply(z$RegTerm, function(x) paste(strsplit(as.character(x), "")[[1]][5:6], collapse = "")))

It looks a bit ugly but all it is doing is separating RegTerm then selecting the first 4 or last 2 characters for year and suf respectively then collapsing (using collapse = "" in paste) them into a single string. We lapply this to the whole column then unlist it to make vector.

I would recommend understanding the first two lines of code in this answer and then it will be made obvious.

Felipe Alvarenga · Accepted Answer · 2019-03-28T12:26:38.140

So, to do it in your sample, I created a handle variable that tells me whether the RegTerm is even or odd.

The reason is simple, odd RegTerm means it is a regular term, whereas even ones will be either winter or summer terms.

library(dplyr)
data <- data.frame(
  ID = c(1, 1, 1, 2, 2, 2, 2),
  RegTerm = c(201810, 201820, 201830, 201910, 201930, 201940, 202010)
  )

dat <- data %>%
  mutate(term = str_extract(RegTerm, '(?<=\\d{4})\\d{1}(?=0)'),
         term = as.numeric(term) %% 2) %>%
  group_by(ID) %>%
  mutate(numTerm = cumsum(term),
         numTerm = ifelse(term == 0, numTerm + 0.5, numTerm))

The first mutate extracts the 5th digit in the RegTerm column and get the rest of its division by 2. If it equals 1, it means it is a regular term, otherwise it will be either summer or winter.

Next I take the cumulative sum of this variable, which will give you in which RegTerm the student is. Then, for every term == 0 I add to numTerm 0.5, to account for the winter and summer terms.

# A tibble: 7 x 4
# Groups:   ID [2]
     ID RegTerm  term numTerm
  <dbl>   <dbl> <dbl>   <dbl>
1     1  201810     1     1  
2     1  201820     0     1.5
3     1  201830     1     2  
4     2  201910     1     1  
5     2  201930     1     2  
6     2  201940     0     2.5
7     2  202010     1     3

This way, if there is a student starting in a winter term, numTerm will be assigned a 0.5 value, having numTerm = 1 only when he reaches a regular term (term == 1)

How can I tidy student enrollment data on a per semester basis?

2 Answers2