1

I have a huge dataset similar to the one below:

Df <- data.frame("Candidate ID"   = c(1042,  1042, 1113, 1113, 1146, 1277, 1381, 1381, 1402, 1402),
                 "Application ID" = c(1040, 13006, 1111, 1125, 1144, 1274,   41,   61,   64,   65),
                  "Question"     = c("Yes", "Yes", "No","Yes", "Yes","No", "No", "No","Yes", "No"))

Background: Candidates can apply to multiple job requisitions and receive each time a different application id. One candidate id can also apply to up to 25 job requisitions, i.e. apply 25 different times. In my simplified example, the maximum is two.

data

I would like to extract all candidate ids where the answers to the question are all "yes". In the example, I would like to extract only the candidate ids 1042 and 1146. How can I do this in R?

I checked in StackOverflow and could not find a matching answer, perhaps also to difficulties in describing the problem. At least I made an effort in creating a minimum reproducible example to make it easy to help me.

Hendrik
  • 35
  • 1
  • 4
  • A couple posts that should help: https://stackoverflow.com/q/31661704/5325862 https://stackoverflow.com/q/55012776/5325862 It is a difficult problem to search for, but the base `any` or `all` functions should be useful – camille Jun 16 '21 at 18:45
  • Thank you all for the answers. It was the first time for me to write a reproducible example and to receive so many useful replies. – Hendrik Jun 17 '21 at 09:49

3 Answers3

4

Does this work:

library(dplyr)
Df %>% group_by(Candidate.ID) %>% filter(all(Question == 'Yes'))
# A tibble: 3 x 3
# Groups:   Candidate.ID [2]
  Candidate.ID Application.ID Question
         <dbl>          <dbl> <chr>   
1         1042           1040 Yes     
2         1042          13006 Yes     
3         1146           1144 Yes     
Karthik S
  • 11,348
  • 2
  • 11
  • 25
0

An option in base R with subset

  1. Extract the 'Candidate.ID' where 'Question' value "Yes"
  2. Extract the 'Candidate.ID' where 'Question' value is "No"
  3. Get the unique 'Candidate.ID' that have 'Yes' value in 'Question' and not any 'No' by doing the setdiff from 1 and 2
  4. Use that 'Candidate.ID' to create a logical vector using %in% with the whole column of 'Candidate.ID' for subsettting rows with 'Candidate.ID' having only 'Yes' value in 'Question'
subset(Df, Candidate.ID %in% setdiff(Candidate.ID[Question == 'Yes'], 
       Candidate.ID[Question == "No"]))
#  Candidate.ID Application.ID Question
#1         1042           1040      Yes
#2         1042          13006      Yes
#5         1146           1144      Yes
akrun
  • 874,273
  • 37
  • 540
  • 662
0

data.table

df <- data.frame("Candidate ID"   = c(1042,  1042, 1113, 1113, 1146, 1277, 1381, 1381, 1402, 1402),
                 "Application ID" = c(1040, 13006, 1111, 1125, 1144, 1274,   41,   61,   64,   65),
                 "Question"     = c("Yes", "Yes", "No","Yes", "Yes","No", "No", "No","Yes", "No"))



library(data.table)
setDT(df)[, .SD[all(Question == "Yes")], by = Candidate.ID]
#>    Candidate.ID Application.ID Question
#> 1:         1042           1040      Yes
#> 2:         1042          13006      Yes
#> 3:         1146           1144      Yes

Created on 2021-06-16 by the reprex package (v2.0.0)

Yuriy Saraykin
  • 8,390
  • 1
  • 7
  • 14