1

I have data that looks like this (approximately 4 million lines):

ID        CODE   DATE
A567001   F11    1/1/2019
A567001   T67    1/1/2019
A567001   P09    1/5/2019
A567001   F11    1/7/2019
A568002   F11    1/9/2019
A568002   A56    1/9/2019
A567891   C45    1/7/2019
A002456   F11    1/10/2019
A002456   H09    1/11/2019
A021324   F11    1/11/2019
A021324   G65    1/10/2019
B125983   F11    1/9/2019
C172749   H76    1/8/2019
...       ...    ...

What I want to do is if a particular ID has 1 or more instances of 'F11' in the "code" column, then I want to keep all instances of that ID (not just the line that has 'F11' in "code"). Otherwise, I want to remove the entire entry from the database and be left with only those IDs that have at least one F11.

In other words, this is the result I would like to have (removed two entries):

ID        CODE   DATE
A567001   F11    1/1/2019
A567001   T67    1/1/2019
A567001   P09    1/5/2019
A567001   F11    1/7/2019
A568002   F11    1/9/2019
A568002   A56    1/9/2019
A002456   F11    1/10/2019
A002456   H09    1/11/2019
A021324   F11    1/11/2019
A021324   G65    1/10/2019
B125983   F11    1/9/2019
...       ...    ...

I am assuming I can use dplyr and tried this command:

placeholder <- mutate(Flag = ifelse(file1$icd10_code == 'F11\\.*',1,0) %>% group_by(file1$new_id) %>% mutate (max_flag = max(flag)))

but get this error:

Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "c('double', 'numeric')"
Melderon
  • 365
  • 1
  • 16

2 Answers2

2

You can use any() to retain cases if any value within a group matches the target value:

library(dplyr)

file1 %>% 
  group_by(ID) %>%
  filter(any(CODE == "F11"))

   ID      CODE  DATE     
   <fct>   <fct> <fct>    
 1 A567001 F11   1/1/2019 
 2 A567001 T67   1/1/2019 
 3 A567001 P09   1/5/2019 
 4 A567001 F11   1/7/2019 
 5 A568002 F11   1/9/2019 
 6 A568002 A56   1/9/2019 
 7 A002456 F11   1/10/2019
 8 A002456 H09   1/11/2019
 9 A021324 F11   1/11/2019
10 A021324 G65   1/10/2019
11 B125983 F11   1/9/2019 

To keep when CODE begins with F11 you can do:

file1 %>% 
  group_by(ID) %>%
  filter(any(grepl("^F11\\.", CODE)))
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
0

We can use data.table. Convert to 'data.table' with setDT, grouped by 'ID', check for single TRUE/FALSE with %in% and subset the data.table

library(data.table)
setDT(file1)[, .SD['F11' %in% CODE], ID]

Or for not a fixed match

setDT(file1)[, .SD[any(grepl("F11\\.x", CODE))], ID]

Or with dplyr, use the same logic

library(dplyr)
file1 %>%
    group_by(ID) %>%
    filter('F11' %in% CODE)
# A tibble: 11 x 3
# Groups:   ID [5]
#   ID      CODE  DATE     
#   <chr>   <chr> <chr>    
# 1 A567001 F11   1/1/2019 
# 2 A567001 T67   1/1/2019 
# 3 A567001 P09   1/5/2019 
# 4 A567001 F11   1/7/2019 
# 5 A568002 F11   1/9/2019 
# 6 A568002 A56   1/9/2019 
# 7 A002456 F11   1/10/2019
# 8 A002456 H09   1/11/2019
# 9 A021324 F11   1/11/2019
#10 A021324 G65   1/10/2019
#11 B125983 F11   1/9/2019 

If it is not a complete match

library(stringr)
file1 %>%
    group_by(ID) %>%
    filter(any(str_detect(CODE, "F11\\.x")))

Or using base R

subset(file1, ave(CODE == 'F11', ID, FUN = any))

data

file1 <- structure(list(ID = c("A567001", "A567001", "A567001", "A567001", 
"A568002", "A568002", "A567891", "A002456", "A002456", "A021324", 
"A021324", "B125983", "C172749"), CODE = c("F11", "T67", "P09", 
"F11", "F11", "A56", "C45", "F11", "H09", "F11", "G65", "F11", 
"H76"), DATE = c("1/1/2019", "1/1/2019", "1/5/2019", "1/7/2019", 
"1/9/2019", "1/9/2019", "1/7/2019", "1/10/2019", "1/11/2019", 
"1/11/2019", "1/10/2019", "1/9/2019", "1/8/2019")), class = "data.frame", row.names = c(NA, 
-13L))
akrun
  • 874,273
  • 37
  • 540
  • 662