How to get right proportions based on a subset of the data set in R

Question

I want to calculate the proportion of plan with respect to school type in the dataset below. Thing is, I have to first subset the dataset a couple of times. In this case, for example, I need to subset the dataset so that I only get schools which offer from level 0 up to level 04. How can I do that?

edit:

The way below is working, but I'm not filtering all schools that have all levels from level 0 up to level 04 , I'm getting the schools that have either of them. Ideas to that would be much appreciated.

data description:

# SCHOOL = name of the school
# Q9 = type of school
# Q11 = levels that the schools offer
# Q40 = types of planning that the school offers related to each level.

Note: All plannings are related to a school level. Hence, If I don't filter by the levels that each school offers and just calculate this columns' prop, my results will be misleading because each school can offer different level options (all of them, 1, 2 of them, etc)

my attempt:


### D) filter schools which only offer UP TO Level 04 ###

##################################### Quest40_2 ##########################
### check vector's names: 

unique(quest40_2$Q11)

unwanted_4_2 <- quest40_2 %>%
  filter(Q11 %in% c('level05','level06')) ### FILTER UNWANTED SCHOOL LEVELS

### create a vector with UNWANTED SCHOOL NAMES:

filter_vec_4_2 <- unique(unwanted_4_2$SCHOOL) ### get unique names (only school has 1 name)

### assign the original dataframe to a dummy data frame: 

filtered_df_4_2 <- quest40_2 

### loop over unwanted schools' names to remove them: 

for (i in 1 :length(filter_vec_4_2)) {
  
  filtered_df_4_2 <- filtered_df_4_2[!filtered_df_4_2$SCHOOL == filter_vec_4_2[i],]
  
}

I need to count how many times EACH 'plan' occurs by each type of 'school' (and get its proportion). Problem = each school can have more than one 'plan' type. Since each school can have more than one plan type, if we want to count the proportion of plan types according to each school type, we cannot divide by n

b <- filtered_df_4_2 %>% drop_na(Q40) %>% 
  count(Q40, Q9) %>%
  group_by(Q9)

Which leaves me to:

### counting the unique schools by each type of 'curriculo':

b2 <- b2 %>%  
  select(SCHOOL, Q9) %>% 
  unique() %>% count(Q9)

## join dataframes and get the proportion of schools that have each type of 'planejamento'
## within the curriculum types 

c <- b %>% full_join(b2, by = 'Q9') %>% 
     mutate(prop = round((n.x/n.y *100), digits = 2)) %>% 
     select(-n.x, -n.y)

Q1 = I don't think I'm filtering it correctely since I'm not exclusively filtering the schools that offer alll levels up to 01 to 04. I guess I'm doing an 'or' not an 'and'
Q2 = Is there a way to avoid the loop with tidyverse ? Thanks in adv.
Ultimately, I'm trying to get the starts to => 'do the schools plan their agenda for the levels they offer or not and how does the type of school impact this?' (later this can be modeled, but now I just need percentages)
data:

dput(quest40_2)              
structure(list(SCHOOL = structure(c("School1", "School1", "School1", 
"School1", "School1", "School1", "School1", "School2", "School2", 
"School2", "School2", "School2", "School2", "School2", "School3", 
"School3", "School3", "School3", "School3", "School3", "School3", 
"School3", "School3", "School4", "School4", "School4", "School4", 
"School4", "School4", "School5", "School5", "School5", "School5", 
"School5", "School5", "School6", "School6", "School6", "School6", 
"School6", "School7", "School7", "School7", "School7", "School7", 
"School7", "School8", "School8", "School8", "School8", "School9", 
"School9", "School9", "School9", "School9", "School10", "School10", 
"School10", "School10", "School10", "School11", "School11", "School11", 
"School11", "School11", "School11", "School11", "School11", "School11", 
"School12", "School12", "School12", "School12", "School12", "School12", 
"School12", "School13", "School13", "School13", "School13", "School13", 
"School13", "School13", "School13", "School13", "School13", "School14", 
"School14", "School14", "School15", "School15", "School15", "School15", 
"School15", "School15", "School16", "School16", "School16", "School16", 
"School16", "School16", "School16", "School17", "School17", "School17", 
"School17", "School17", "School18", "School18", "School18", "School18", 
"School18", "School19", "School19", "School19", "School19", "School19", 
"School19", "School20", "School20", "School20", "School21", "School21", 
"School21", "School21", "School21", "School22", "School22", "School22", 
"School22", "School22", "School23", "School23", "School23", "School23", 
"School23", "School23", "School24", "School24", "School24", "School24", 
"School24", "School24", "School25", "School25", "School25", "School25", 
"School25", "School26", "School26", "School26", "School26", "School26", 
"School26", "School26", "School26", "School26", "School27", "School27", 
"School27", "School27", "School27", "School27", "School27", "School28", 
"School28", "School28", "School28", "School28", "School28", "School28", 
"School28", "School29", "School29", "School29", "School29", "School29", 
"School29", "School30", "School30", "School30", "School30", "School30", 
"School30", "School30", "School30", "School30", "School30", "School30", 
"School31", "School31", "School31", "School31", "School31", "School31", 
"School31", "School31", "School31", "School31", "School31", "School32", 
"School32", "School32", "School32", "School32", "School32", "School32", 
"School32", "School32", "School32", "School32", "School33", "School33", 
"School33", "School33", "School33", "School33", "School33", "School34", 
"School34", "School34", "School34", "School34", "School34", "School34", 
"School34", "School35", "School35", "School35", "School35", "School36", 
"School36", "School36", "School36", "School36", "School37", "School37", 
"School37", "School37", "School37", "School37", "School37", "School37", 
"School38", "School38", "School38", "School38", "School39", "School39", 
"School39", "School39", "School39", "School39"), class = c("glue", 
"character")), Q9 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 
3L, 3L, 3L, 3L, 3L), .Label = c("typeA", "typeB", "typeC", "typeD"
), class = "factor"), Q11 = structure(c(3L, 7L, 4L, 2L, NA, NA, 
NA, 7L, 4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, NA, NA, NA, 
NA, 4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, NA, 7L, 4L, 2L, 
1L, NA, 4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, NA, 4L, 2L, 1L, 5L, 
NA, 4L, 2L, 1L, 5L, NA, 4L, 2L, 1L, 5L, NA, NA, NA, NA, NA, 7L, 
4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA, NA, 
5L, 6L, NA, 4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, 5L, NA, NA, NA, 
7L, 4L, 2L, 1L, NA, 7L, 4L, 2L, 1L, NA, 4L, 2L, 1L, NA, NA, NA, 
2L, 1L, NA, 4L, 2L, 1L, 5L, NA, 4L, 2L, 1L, 5L, NA, 7L, 4L, 2L, 
1L, NA, NA, 4L, 2L, 1L, NA, NA, NA, 4L, 2L, 1L, 5L, NA, 7L, 4L, 
2L, 1L, 5L, NA, NA, NA, NA, 2L, 1L, 5L, NA, NA, NA, NA, 3L, 7L, 
4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, 5L, NA, NA, 7L, 4L, 2L, 1L, 
5L, 6L, NA, NA, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA, 
NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA, NA, NA, 7L, 4L, 2L, 
1L, 5L, 6L, NA, 7L, 4L, 2L, 1L, 5L, NA, NA, NA, 3L, 7L, 4L, NA, 
7L, 4L, 2L, NA, NA, 4L, 2L, 1L, 5L, NA, NA, NA, NA, 4L, 2L, 1L, 
NA, 7L, 4L, 2L, 1L, 5L, NA), .Label = c("level04", "level03", 
"level0", "level02", "level05", "level06", "level01"), class = "factor"), 
    Q40 = structure(c(NA, NA, NA, NA, 2L, 6L, 5L, NA, NA, NA, 
    NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA, 
    NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, 
    NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, 
    NA, NA, NA, NA, 1L, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 1L, NA, 
    NA, NA, NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, NA, 2L, 6L, 4L, 
    3L, NA, NA, 1L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 2L, 
    6L, 5L, NA, NA, NA, NA, 2L, NA, NA, NA, NA, 1L, NA, NA, NA, 
    2L, 6L, 5L, NA, NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 
    1L, NA, NA, NA, NA, 2L, 1L, NA, NA, NA, 2L, 6L, 5L, NA, NA, 
    NA, NA, 1L, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA, NA, 
    2L, 6L, 5L, 4L, NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA, 
    NA, 2L, 1L, NA, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 3L, NA, 
    NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 3L, NA, NA, NA, NA, NA, 
    NA, 2L, 6L, 5L, 4L, 3L, NA, NA, NA, NA, NA, NA, 1L, NA, NA, 
    NA, NA, NA, 2L, 6L, 5L, NA, NA, NA, 2L, NA, NA, NA, 2L, 6L, 
    NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA, NA, 1L, NA, NA, NA, 
    NA, NA, 1L), .Label = c("none", "plan1_level0upto02", "plan5_level_05", 
    "plan4_level_05", "plan3_level_04", "plan2_level_03"), class = "factor")), row.names = c(NA, 
-253L), class = c("tbl_df", "tbl", "data.frame"))

At first glance it seems strange that whenever there's a plan, levels are NA. So any question regarding a connection between these should be tricky. A first analysis of plans and type can be done by counting after grouping, e.g. `df %>% group_by(Q40, Q9) %>% reframe(n = n()) %>% distinct()` — Andre Wildberg, Feb 17 '23 at 19:27
hi, @AndreWildberg , so the NAs come from Google Forms (https://stackoverflow.com/questions/75424016/nas-in-a-column-that-doesnt-contain-nas-r). I realized that it was better to keep them in order to avoid miscouting the data. concerning the solution, ```reframe``` isn't working for me, I have ```tidyverse``` loaded, but it is saying that it cannot find the function...funny thing — Larissa Cury, Feb 17 '23 at 22:09
It's new in `dplyr` 1.1.0. You can use `summarize` in earlier versions. — Andre Wildberg, Feb 17 '23 at 22:29
@AndreWildberg oh, I see, I'll update later. It worked with ```summarize```, thanks, but it still haven't solved my problem...can you look at the edit? — Larissa Cury, Feb 17 '23 at 22:35

Andre Wildberg · Answer 1 · 2023-02-17T23:42:52.117

This might be what you need or give you a good start. It shows the number of plans per type with levels 0-4, also shows the percentages.

left_join(na.omit(quest40_2[, c("SCHOOL", "Q9", "Q11")]), 
          na.omit(quest40_2[, c("SCHOOL", "Q40")]), c("SCHOOL"), 
          multiple = "all") %>% 
  group_by(Q9, Q40) %>% 
  filter(sub(".*(\\d+)$", "\\1", Q11) <= 4) %>% 
  summarize(n = n(), .groups = "drop") %>% 
  mutate(percentage = n / sum(n) * 100) %>% 
  print(n = Inf)
# A tibble: 23 × 4
   Q9    Q40                    n percentage
   <fct> <fct>              <int>      <dbl>
 1 typeA none                   3      0.974
 2 typeA plan1_level0upto02     6      1.95 
 3 typeA plan5_level_05         4      1.30 
 4 typeA plan4_level_05         6      1.95 
 5 typeA plan3_level_04         6      1.95 
 6 typeA plan2_level_03         6      1.95 
 7 typeB none                  21      6.82 
 8 typeB plan1_level0upto02    47     15.3  
 9 typeB plan5_level_05         4      1.30 
10 typeB plan4_level_05        14      4.55 
11 typeB plan3_level_04        31     10.1  
12 typeB plan2_level_03        31     10.1  
13 typeC none                  32     10.4  
14 typeC plan1_level0upto02    21      6.82 
15 typeC plan4_level_05         4      1.30 
16 typeC plan3_level_04        15      4.87 
17 typeC plan2_level_03        18      5.84 
18 typeD none                   3      0.974
19 typeD plan1_level0upto02     8      2.60 
20 typeD plan5_level_05         8      2.60 
21 typeD plan4_level_05         8      2.60 
22 typeD plan3_level_04         4      1.30 
23 typeD plan2_level_03         8      2.60

How to get right proportions based on a subset of the data set in R

1 Answers1

Linked