2

I'm trying to select groups in a grouped df that contain a specific string on a specific row within each group.

Consider the following df:

df <- data.frame(id = c(rep("id_1", 4),
                        rep("id_2", 4),
                        rep("id_3", 4)),
                 string = c("here",
                            "is", 
                            "some",
                            "text",
                            "here",
                            "is",
                            "other",
                            "text",
                            "there",
                            "are",
                            "final",
                            "texts"))

I want to create a dataframe that contains just the groups that have the word "is" on the second row.

Here is some incorrect code:

desired_df <- df %>% group_by(id) %>% 
        filter(slice(select(., string), 2) %in% "is")

Here is the desired output:

desired_df <- data.frame(id = c(rep("id_1", 4),
                                      rep("id_2", 4)),
                               string = c("here",
                                          "is", 
                                          "some",
                                          "text",
                                          "here",
                                          "is",
                                          "other",
                                          "text"))

I've looked here but this doesn't solve my issue because this finds groups with any occurrence of the specified string.

I could also do some sort of separate code where I identify the ids and then use that to subset the original df, like so:

ids <- df %>% group_by(id) %>% slice(2) %>% filter(string %in% "is") %>% select(id)
desired_df <- df %>% filter(id %in% ids$id)

But I'm wondering if I can do something simpler within a single pipe series.

Help appreciated!

Daniel Yudkin
  • 494
  • 4
  • 11

1 Answers1

2

After grouping by 'id', subset the 'string' for the second element and apply %in% with "is" on the lhs of %in% to return a single TRUE per group

library(dplyr)
df %>%
    group_by(id) %>% 
    filter('is' %in% string[2]) %>%
    ungroup

-output

# A tibble: 8 x 2
#  id    string
#  <chr> <chr> 
#1 id_1  here  
#2 id_1  is    
#3 id_1  some  
#4 id_1  text  
#5 id_2  here  
#6 id_2  is    
#7 id_2  other 
#8 id_2  text  
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    It's not totally clear if OP needs second row to contain _only_ the string "is", but if so, could tighten up the constraints a bit with `df %>% group_by(id) %>% filter(str_detect(string[2], "^is$"))`. – andrew_reece Dec 22 '20 at 19:51
  • @andrew_reece Based on the OP's code `%in% "is"`, it looks to me like a fixed match – akrun Dec 22 '20 at 19:57
  • wouldn't OP's `string %in% 'is'` match iff `string == "is"`? vs your `'is' %in% string[2]` which could also match "is not"? – andrew_reece Dec 22 '20 at 20:13
  • @andrew_reece Both `%in%` and `==` are checking fixed matches and not substring – akrun Dec 22 '20 at 20:16
  • @andrew_reece IN this case either `string[2] %in% 'is'` or `'is' %in% string[2]` would give a single TRUE/FALSE value – akrun Dec 22 '20 at 20:18
  • oh, i didn't know that. thanks @akrun. that seems like a funny use of "in", if i think about it. (a bit like saying <= is ==.). coming from python, i'm used to `"is" in "is not" == True`. – andrew_reece Dec 22 '20 at 20:21
  • 1
    @andrew_reece If it is a substring your `str_detect` should work and also with the start, end for fixed match – akrun Dec 22 '20 at 20:24