0

I am working with a survey where participants answer the first question with yes or no and then a second open-ended question "if yes, why?"

I need to find out the percentage of people who answer the second question after saying "yes". Or alternatively, I need to find the number of 'NA's after they answer "yes".

Here is a similar-looking dataset:

#>      helpful     helpfulhow               
#> 1    n           NA
#> 2    y           Because this study cannot be put online. Thus I have to create a random wall of text    
#> 3    n           NA         
#> 4    y           This is a confidential study. Thus the data must be changed.
#> 5    n           NA   
#> 6    n           NA
#> 7    y           This is a confidential study. Thus the data must be changed every time. 
#> 8    y           NA
#> 9    y           Qualitative studies are difficult to assess. Here is a random wall of text.
> str(b)
'data.frame':   9 obs. of  2 variables:
 $ helpful   : Factor w/ 2 levels "n","y": 1 2 1 2 1 1 2 2 2
 $ helpfulhow: Factor w/ 4 levels "Because this study cannot be put online. Thus I have to create a random wall of text.",..: NA 1 NA 4 NA NA 3 NA 2
> dput(head(b))
structure(list(helpful = structure(c(1L, 2L, 1L, 2L, 1L, 1L), .Label = c("n", 
"y"), class = "factor"), helpfulhow = structure(c(NA, 1L, NA, 
4L, NA, NA), .Label = c("Because this study cannot be put online. Thus I have to create a random wall of text.", 
"Qualitative studies are difficult to assess. Here is a random wall of text.", 
"This is a confidential study. Thus the data must be changed every time.", 
"This is a confidential study. Thus the data must be changed."
), class = "factor")), row.names = c(NA, 6L), class = "data.frame")

So for example, I want to find out how many people who put 'y's under helpful also put 'NA' under helpfulhow. Thanks in advance.

inkyfingers
  • 45
  • 1
  • 6
  • welcome to stackoverflow. You need to give information related to how the data are structured. Please see https://stackoverflow.com/help/minimal-reproducible-example – greengrass62 Jul 06 '20 at 19:47
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 06 '20 at 19:47

2 Answers2

2

I have made an example dataset like below; here, I am counting number of rows with Question-1 answered as "Yes" and Question-2 either as empty (using trimws to get rid of spaces) or as NA. Then, divided by the total number of rows, we get the fraction. Using percent from package scales I convert it to a percentage.

#>      Name  Q1               Q2
#> 1   Jerry Yes             <NA>
#> 2    Beth  No                 
#> 3 Jessica Yes                 
#> 4   Morty Yes       Aww,Babola
#> 5  Summer  No                 
#> 6    Rick Yes Wubbalubbadubdub


## percentage of people who answered yes to Q1 and also answered Q2
scales::percent(nrow(with(df, 
                          df[Q1=="Yes" & 
                            (trimws(Q2) != "" & !is.na(Q2)),]))/nrow(with(df, 
                                                                          df[Q1=="Yes",])))

#> [1] "50.0%"

Data:

df <- structure(list(Name = structure(c(2L, 1L, 3L, 4L, 6L, 5L), 
                                      .Label = c("Beth", "Jerry", "Jessica", "Morty", "Rick", "Summer"), class = "factor"), 
                     Q1 = structure(c(2L, 1L, 2L, 2L, 1L, 2L), 
                                    .Label = c("No", "Yes"), class = "factor"), 
                     Q2 = structure(c(NA, 1L, 2L, 3L, 1L, 4L), 
                                    .Label = c("", "       ", "Aww,Babola", "Wubbalubbadubdub"), class = "factor")), 
                class = "data.frame", row.names = c(NA, -6L))

For your dataset, it would be like this:

scales::percent(nrow(with(b, b[helpful=="y" & (trimws(helpfulhow) != "" & !is.na(helpfulhow)),]))/nrow(with(b, b[helpful=="y",])))

#> [1] "100%"

To make it cleaner, we can use dplyr package:

library(dplyr)
library(scales)

percent(
  b %>% 
    filter(helpful == "y", !is.na(helpfulhow), trimws(helpfulhow) != "") %>% 
    nrow(.) / {b %>% filter(helpful == "y") %>% nrow(.)})

#> [1] "100%"

or

b %>% 
  group_by(helpful) %>% 
  summarise(percent_helpfulhow = percent(sum(trimws(helpfulhow) != "" & !is.na(helpfulhow)) / n())) %>% 
  filter(helpful == "y") %>% 
  pull(2)

#> [1] "100%"
M--
  • 25,431
  • 8
  • 61
  • 93
2

Here is a possible solution using the packages dplyr and janitor:

library(dplyr)
library(janitor)

df %>% 
  mutate(na_flag = ifelse(helpful == 'y' & is.na(helpfulhow), "Y", "N")) %>% 
  tabyl(na_flag) %>% 
  adorn_pct_formatting

Which gives us:

 na_flag n percent
       N 6  100.0%

If every response to helpfulhow in this sample dataset (n = 6) was NA, this would show:

 na_flag n percent
       N 4   66.7%
       Y 2   33.3%

Since two respondents answered y for helpful but did not leave a response for helpfulhow.

If you just want to look at y respondents, you can do:

df %>% 
  filter(helpful == "y") %>%
  mutate(na_flag = ifelse(is.na(helpfulhow), "Y", "N")) %>% 
  tabyl(na_flag) %>% 
  adorn_pct_formatting
Matt
  • 7,255
  • 2
  • 12
  • 34