0

I have a dataset that looks at college enrollment. I'm trying to find the proportion of students enrolled in biology per institute. I find the enrollment(EFTOTLT) for each school first using:

    #find sum of students by school
    total_enrollment <- school_data_unit_cip %>%
    group_by(UNITID) %>%
    summarise(Freq = sum(EFTOTLT))

This yields a tibble that's 2,207 x 2, then I find the enrollment for Biology for each school using:

    #find total biology enrollment by school
    total_biol_enrollment <- school_data_unit_cip %>%
    group_by(UNITID) %>%
    filter(CIPCODE == "26") %>%
    summarise(Freq = sum(EFTOTLT))

Then I realize this yields a tibble that's 1,560 x 2. So there are obviously schools that don't offer biology or don't have biology students.

Is there a way to deselect schools from the first tibble that don't have the CIPCODE 26? Or I guess is there a way to remove schools from the first list that don't exist in the second list?

  • Questions on SO (especially in R) do much better if they are reproducible and self-contained. By that I mean including sample representative data (perhaps via `dput(head(x))` or building data programmatically (e.g., `data.frame(...)`), possibly stochastically), perhaps actual output (with verbatim errors/warnings) versus intended output. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Mar 08 '22 at 19:41

2 Answers2

0

updated after the remarks in the other answer.

i think you can filter them out if you group first, but don't no for sure without the data:

total_biol_enrollment <- school_data_unit_cip %>%
    group_by(UNITID) %>% 
    filter(!any(CIPCODE== "26"))
ReneSch78
  • 21
  • 2
0

Without sample data it's a guess, but ... assuming that each school may have more than one CIPCODE, and you want only schools that contain at least CIPCODE == "26", then perhaps

school_data_unit_cip %>%
  filter(! "26" %in% CIPCODE)
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Yeah unfortunately i did a tragic job of explaining. Essentially, there are tens of thousands of rows because each observation is subcategorized by various demographics. So by filtering out an observation without a 26 CIPCODE I could just be eliminating a demographic at a school from the list. – Ryan O'Toole Mar 08 '22 at 19:51
  • If I can make a list of the UNITID's from the second tibble, I guess I'm wondering if I can use that to my advantage by only filtering those schools from the original list? – Ryan O'Toole Mar 08 '22 at 19:52
  • I really don't know, and would prefer to not speculate without knowing your data. – r2evans Mar 08 '22 at 19:56
  • No worries, appreciate your help. It's a pretty low-stakes analysis, was supposed to do it in excel for a class, but I'm self-teaching R so I thought I'd try. – Ryan O'Toole Mar 08 '22 at 19:58