1

I would like to extract rows from a data frame on the basis of values from two (or more) columns, where the columns contain the same character strings. For example, in the data frame below, I would like to extract all the rows that contain “jam” and “fish”.

This is sort of close, but it focuses on a single column rather than selecting columns on 2 or more columns.

df <- structure(list(num = 1:4, 
                     term_1 = c("jam", "bananna", "fish", 
                                           "carrot"), 
                     term_2 = c("fish", "jam", "apple", "halva"), 
                     term_3 = c("halva", "fish", "carrot", "fish")), 
                row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

df

#>   num  term_1 term_2 term_3
#> 1   1     jam   fish  halva
#> 2   2 bananna    jam   fish
#> 3   3    fish  apple carrot
#> 4   4  carrot  halva   fish

I thought something like this might work, but it doesn't produce what I'm looking for.

df %>% 
  filter(term_1 == c("jam", "fish") | term_2 == c("jam", "fish") | 
           term_3 == c("jam", "fish"))

I’ve tried also tried some other filter() variants--along with if_all()--and str_select with filter(), but I'm missing some detail. Any direction will be appreciated.

Created on 2021-07-12 by the reprex package (v2.0.0)

Created on 2021-07-12 by the reprex package (v2.0.0)

avgoustisw
  • 213
  • 1
  • 7

3 Answers3

2

We could use

library(dplyr)
df %>%
    filter(if_any(-num, ~ . %in% "jam") & 
       if_any(-num, ~ . %in% "fish"))
# A tibble: 2 x 4
    num term_1  term_2 term_3
  <int> <chr>   <chr>  <chr> 
1     1 jam     fish   halva 
2     2 bananna jam    fish  
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You can also use rowSums:

df%>%filter(rowSums(across(-num, ~.%in% c("fish", 'jam')))>=2)

# A tibble: 2 x 4
    num term_1  term_2 term_3
  <int> <chr>   <chr>  <chr> 
1     1 jam     fish   halva 
2     2 bananna jam    fish  
GuedesBF
  • 8,409
  • 5
  • 19
  • 37
  • Thanks @GuedesBF--can you explain what the -num and ~. arguments are doing? – avgoustisw Jul 13 '21 at 00:25
  • 1
    `-num` is telling `across()` to apply the lambda function (~...) to all columns BUT 'num'. The tilde (~) is shorform for the lambda function. The dot(.) is a placeholder for all the data that comes from the left side of the pipe (`%>%`), or `df`in this case. In this code, the dot can be replaced with `df`. In `purrr` and `dplyr`, `~x+1` (lambda function) is the same as`function(x) x+1`. You should check the `tidyverse`, it is awesome. I strongly recommend you start with dplyr, than purrr. https://dplyr.tidyverse.org/reference/across.html – GuedesBF Jul 13 '21 at 01:16
  • This will also select rows with values "fish", "fish" or "jam" , "jam" – Ronak Shah Jul 13 '21 at 06:47
  • Yes, you are right, Ronak. I suspect it is not an issue with the OPs data. But I agree `condition 1 & condition 2` is the safest way to go here. – GuedesBF Jul 13 '21 at 15:10
1

You can use rowSums as -

df[rowSums(df[-1] == 'fish') > 0 & rowSums(df[-1] == 'jam') > 0, ]

#   num term_1  term_2 term_3
#  <int> <chr>   <chr>  <chr> 
#1     1 jam     fish   halva 
#2     2 bananna jam    fish  
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213