How can I summarize categorical variables in R by conditions?

Question

I have student dataset similar to this:

ID GENDER AGE ACADEMIC_STANDING
1    M     20   Good
2    F     21   Probation 1
4    M     19   Probation 1
2    F     21   Probation 2
2    F     21   Dismissal
4    M     19   Probation 2
1    M     20   Good

As you can see, we have a student repeating (often more than 30 times) in other rows due to different semesters and courses. I want to ultimately have 1 single row for each student with additional columns based on the ACADEMIC_STANDING column. There will be 3 new columns: If a student has good standing, the good column will be true, others false. If a student has Prob 1 and Prob 2, Prob12 column will be true. Finally, if a student has all Prob 1, Prob 2 and Dismissal, then the Dismissed column will be true.

Expected Output:

ID GENDER AGE   Good    Prob12    Dismissed
1    M     20   True     False      False
2    F     21   False    False      True
4    M     19   False    True       False

This is similar to what I want, however with conditions on the column. I'm not that familiar with R (I use python), only using it for a course project — Sakib Shahriar, Dec 02 '19 at 12:10

score 2 · Accepted Answer · answered Dec 02 '19 at 12:14

For each ID we can check for the conditions to satisfy and assign values accordingly to ACADEMIC_STANDING and then convert the data to wide format using pivot_wider.

library(dplyr)

df %>%
  group_by(ID, GENDER, AGE) %>%
  summarise(ACADEMIC_STANDING = case_when(
     any(ACADEMIC_STANDING == "Good") ~ "Good", 
     all(c("Probation1", "Probation2", "Dismissal") %in% ACADEMIC_STANDING) ~"Dismissal",
     all(c("Probation1", "Probation2") %in% ACADEMIC_STANDING) ~ "Prob12"))  %>%
   mutate(val = TRUE) %>%
   tidyr::pivot_wider(names_from = ACADEMIC_STANDING, values_from = val, 
                      values_fill = list(val = FALSE))

#     ID GENDER   AGE Good  Dismissal Prob12
#   <int> <fct>  <int> <lgl> <lgl>     <lgl> 
#1     1 M         20 TRUE  FALSE     FALSE 
#2     2 F         21 FALSE TRUE      FALSE 
#3     4 M         19 FALSE FALSE     TRUE

data

df <- structure(list(ID = c(1L, 2L, 4L, 2L, 2L, 4L, 1L), GENDER = structure(c(2L, 
1L, 2L, 1L, 1L, 2L, 2L), .Label = c("F", "M"), class = "factor"), 
AGE = c(20L, 21L, 19L, 21L, 21L, 19L, 20L), ACADEMIC_STANDING = structure(c(2L, 
3L, 3L, 4L, 1L, 4L, 2L), .Label = c("Dismissal", "Good", 
"Probation1", "Probation2"), class = "factor")),
class = "data.frame", row.names = c(NA, -7L))

score 1 · Answer 2 · answered Dec 02 '19 at 12:16

Here is my approach:

tibble::tibble(ID = c(1, 2, 4, 2, 2, 4, 1),
               GENDER = c("M", "F", "M", "F", "F", "M", "M"),
               AGE = c(20, 21, 19, 21,21,19,20),
               ACADEMIC_STANDING = c("Good", "Probation 1", 
                                     "Probation 1", "Probation 2", 
                                     "Dismissal", "Probation 2", 
                                     "Good")) %>% 
  tidyr::nest(-c(ID, AGE)) %>% 
  dplyr::mutate(Good = purrr::map_lgl(data, ~ any(.$ACADEMIC_STANDING == "Good")),
                Prob12 = purrr::map_lgl(data, ~ any(.$ACADEMIC_STANDING == "Probation 1") &
                                      any(.$ACADEMIC_STANDING == "Probation 2")),
                Dismissed = purrr::map_lgl(data, ~ all(c("Probation 1", 
                                                     "Probation 2",
                                                     "Dismissal") %in% .$ACADEMIC_STANDING)))

# A tibble: 3 x 6
     ID   AGE           data Good  Prob12 Dismissed
  <dbl> <dbl> <list<df[,2]>> <lgl> <lgl>  <lgl>    
1     1    20        [2 x 2] TRUE  FALSE  FALSE    
2     2    21        [3 x 2] FALSE TRUE   TRUE     
3     4    19        [2 x 2] FALSE TRUE   FALSE

How can I summarize categorical variables in R by conditions?

2 Answers2