Test for set inclusion and processing data simultaneously in tidyverse

Question

I almost have what I need. I need some help with the last detail! The data set is produced by the following:

stu_vec <- c("A","B","C","D","E","F","G","H","I","J")
college_vec <- c("ATC","CCTC","DTC","FDTC","GTC","NETC", "USC", "Clemson", "Winthrop", "Allen")
sctcs <- c("ATC","CCTC","DTC","FDTC","GTC","NETC")
Student <- sample (stu_vec, size=100,replace=T, prob=c(.08,0.09,0.06,.07,.12,.10,.07,.05,.11,.05))
College <- sample(college_vec, size=100, replace=T,prob=c(.08,.07,.13,.12,.11,.06,.05,.08,.02,.08))

test.dat1 <- as.data.frame(cbind(Student, College))

I am using the following code to create what I need

library(dplyr)

set.seed(29)
test.dat2 <- test.dat1 %>% 
  group_by(Student, .drop=F) %>% #group by student
  mutate(semester= sequence(n())) %>% #set semester sequence
  summarise(home_school= College[min(which(College %in% sctcs))], # Find first college in sctcs
            seq_home=min(which(College %in% sctcs)), # add column of sequence values
            new_school= if_else(n_distinct(College) > 1, 
            first(College[!(College %in% sctcs) & semester > seq_home]), last(College))) #new_school should be the first non-sctcs school after the sctcs school is found or the last school for that student.

it produces the following table

I want the NA's to be filled in with the last college for that student. I don't know how to get rid of the NA's. If you know an easier way to produce the same thing please share the knowledge.

Don't use `as.data.frame(cbind(Student, College))`, instead use `data.frame(Student, College)`. It's less typing, and it avoids the problem that `cbind` will create a matrix, converting any numbers you have to `character` because matrixes can only have one class. — Gregor Thomas, Jan 15 '21 at 19:03
As for filling in `NA`s: `... %>% group_by(student) %>% tdiyr::fill(new_school)`. — Gregor Thomas, Jan 15 '21 at 19:04
Thank you for your feedback, but using tidry::fill(new_school) would throw off the next step which is to count the number of students who transferred out of sctcs colleges. — asokol, Jan 15 '21 at 19:19
I guess I don't understand then. Maybe you want to replace `last(College)` with `last(College[!is.na(College)])` to get the last non-missing College in your `ifelse`? — Gregor Thomas, Jan 15 '21 at 19:27
the last part of the ```if_else``` statement isn't working and I don't know why. — asokol, Jan 15 '21 at 19:30
A `set.seed()` would be useful to make your random data reproducible. But, when I ran it I got a student who has several colleges and all are in `sctcs`. So, for your `if_else`, `n_distinct(College) > 1` is TRUE, but there are no non-sctcs colleges, so `first(College[!(College %in% sctcs) & semester > seq_home]` is `NA` because `College[!(College %in% sctcs)` is empty. What do you want to do in this case? — Gregor Thomas, Jan 15 '21 at 19:55
I would like to have the last college in sctcs to be listed. — asokol, Jan 15 '21 at 19:58
Okay, when I fix that issue I then get a student with college order `Allen, FDTC, NETC, DTC`. So, their first school is not `sctcs`, but the rest of the schools are `sctcs`. No non-sctcs schools exist with `semester > seq_home`. What do you want done in this case? — Gregor Thomas, Jan 15 '21 at 20:20
home_school is the first school in sctcs to appear, in your example, FDTC would be the home_school. new_school should be DTC. — asokol, Jan 15 '21 at 20:30

Captain Hat · Answer 1 · 2021-01-19T17:27:36.257

1

It's not clear what you're trying to do. But when [!(College %in% sctcs) & semester > seq_home] returns FALSE, College[!(College %in% sctcs) & semester > seq_home] returns a zero-length character vector, so first(College[!(College %in% sctcs) & semester > seq_home]) returns NA.

When there are no TRUE values in [!(College %in% sctcs) & semester > seq_home], it's because there are no non-sctcs colleges in any of the semesters after semester[seq_home]. If a student transfers from home_school to one or more sctcs schools, but never to any non-sctcs schools, you'll get an NA value.

You're effectively asking the wrong question. I'm not sure what question you're trying to ask, but what you're currently asking is:

What's the first non-sctcs school this student attended after they attended their first sctcs school?

Some students, however, never attend a non-sctcs school after attending their first sctcs school. For this reason, you get an NA response, which is the correct answer to the question.

edited Jan 19 '21 at 17:27

answered Jan 19 '21 at 14:58

Captain Hat

2,444
1
14
31

1

I understand that it returns a zero-length vector. What I want to know is how to fix it. In sum, what I am trying to do is count the number of students who transferred out of the sctcs schools. I started by selecting the first college in sctcs, but sometimes the first sctcs college is not the first college a student attends. Then I found the first college outside sctcs, which has to be after they entered an sctcs school. For example, if a student starts at Clemson -> NETC-> USC, ```home_school``` is = NETC and ```new_school``` = USC, else the last school they attended. – asokol Jan 19 '21 at 17:06
1

Hmm, I think I might understand now? I've added some more detail to the answer above. The gist of it is that you're asking the 'wrong question' with your sub-setting expression. – Captain Hat Jan 19 '21 at 17:28
I understand how I was asking the wrong question any Ideas on how to fix the issue? – asokol Jan 19 '21 at 17:56
I can't answer that, because I'm still not sure what question you're trying to ask. If you can formulate that question precisely, you'll be over halfway to answering your own question. – Captain Hat Jan 19 '21 at 18:01
Incidentally is this an assignment? – Captain Hat Jan 19 '21 at 18:02
1

@Captian Hat no, this is some of a pet project for work. the question I am asking is "What's the first non-sctcs school this student attended after they attended their first sctcs school?" and if they did not go to a non-sctcs school what is the last school they attended? – asokol Jan 19 '21 at 18:12

Captain Hat · Accepted Answer · 2022-05-17T09:06:09.407

This ought to do it:

test.dat2 <- test.dat1 |> 
  mutate(semester= sequence(n())) |>
  arrange(Student, semester) |> # find this a more intuitive order
  group_by(Student, .drop=F) |>
  # Additional mutate step for clarity & simplicity
  mutate(seq_home = min(which(College %in% sctcs))) |>
  summarise(home_school = College[seq_home],
            new_school = 
              College[
                coalesce(
                  first(which(!(College %in% sctcs) & semester > seq_home)),
                  seq_home,
                  length(College))
                  ]
            )

We're indexing College with coalesce(), which returns the first non-missing value from it's arguments. Initially, we look for first non-sctcs college they attended after attending home_school. If that returns NA (i.e. there is no such college), we just return seq_home, to get the last sctcs college they attended. If that returns NA (as would be the case if they had never attended any sctcs colleges), we return length(College), which of course subsets College to give us the last college they attended.

I'm still not 100% clear on whether this does exactly what you want - I don't know if you'd considered the case where there were no sctcs colleges. There are none on this seed, but it could easily have happened.

I think your original question betrays a poor understanding of what NA's are: you've asked about how to 'get rid of NAs', but the NAs are the correct answer. Even the revisions to your question do not account for every possibility, which makes the return of NA's possible in unforeseen cases, such as `!any(Colleges %in% sctcs)` — Captain Hat, Jan 19 '21 at 20:01

Test for set inclusion and processing data simultaneously in tidyverse

2 Answers2