get rows of where the column has either one or both of the strings in R

Question

I have a column that has a list of items like this

Fruit
Apple
Apple, Orange
Kiwi, Orange, Apple 
Kiwi

I want to get the rows that contain (Apple, Orange). I'm not sure how to do it, I've tried str_detect and filter but none has worked so far to any other advice would be appreciated.

extension to this question. Lets say im getting the rows of apple and orange, but its also giving me the rows of pineapple since apple is in that word as well. how do I prevent this? — user14262341, Nov 26 '20 at 22:42
You can do that with the `grepl()` answer (bellow) with a explicit regex. You can add in front of the regex expression `(\\s|^)`. `\\s` is the space (" ") character and and `^` represents the start of the string. So by writing `(\\s|^)([Aa]pple|[Oo]range)` you select every string containing "apple" or "orange" that have a space before the word or that is in the bigining of the string (sentence). Finally you can write `df[grepl("(\\s|^)([Aa]pple|[Oo]range)", df$fruits),, drop=FALSE]`. — DoRemy95, Nov 27 '20 at 07:29

Karthik S · Accepted Answer · 2020-11-26T18:08:45.567

1

Does this work:

library(dplyr)
library(stringr)
df %>% filter(str_detect(Fruit, 'Apple|Orange'))
# A tibble: 3 x 1
  Fruit              
  <chr>              
1 Apple              
2 Apple, Orange      
3 Kiwi, Orange, Apple

Data used:

df
# A tibble: 4 x 1
  Fruit              
  <chr>              
1 Apple              
2 Apple, Orange      
3 Kiwi, Orange, Apple
4 Kiwi

edited Nov 26 '20 at 18:08

answered Nov 26 '20 at 17:54

Karthik S

11,348
2
11
25

this works but what if I also wanted to print out row 3 since it does contain the words apple and orange – user14262341 Nov 26 '20 at 18:04
@user14262341, have updated my answer, please check if it works for you. – Karthik S Nov 26 '20 at 18:05
ah yes but you are also missing apple, so rows 1 through 3 should show up – user14262341 Nov 26 '20 at 18:07
@user14262341, so you need rows that either contain Apple or Orange or Both? Have updated my answer. Please check now. – Karthik S Nov 26 '20 at 18:07
You can write `df %>% filter(str_detect(Fruit, "(\\s|^)([Aa]pple|[Oo]range)"))` which does the trick. See comment to your question above – DoRemy95 Nov 27 '20 at 07:36

DoRemy95 · Answer 2 · 2020-11-26T19:38:46.507

0

Personally, I like using grepl() for those kind of problems. You can play around the regex to select rows. (See example here)

df <- data.frame(list("fruits" = c("Apple", "Apple, Orange", "Kiwi, Apple", "Kiwi")))

Visualization of df:

| id | fruits        | 
|----|---------------|
| 1  | Apple         | 
| 2  | Apple, Orange |
| 3  | Kiwi, Apple   |
| 3  | Kiwi          |

Then you can write:

df_only_apples <- df[grepl("[Aa]pple", df$fruits),, drop=FALSE]

which will give you

| id | fruits        | 
|----|---------------|
| 1  | Apple         | 
| 2  | Apple, Orange |
| 3  | Kiwi, Apple   |

But if you want to select rows that contain "Apples" and "Oranges" you could just write df[grepl("([Aa]pple|[Oo]range)", df$fruits)

edited Nov 26 '20 at 19:38

answered Nov 26 '20 at 18:30

DoRemy95

614
3
19

yeah for some reason this doesnt work if the order isn't the same? like if it had been orange, kiwi, apple this does not work – user14262341 Nov 26 '20 at 22:43
The order of the words in a sentence should not affect the results from grepl(). I also tested the regex `([Aa]pple|[Oo]range)` which works fine now. – DoRemy95 Nov 27 '20 at 07:16

score 0 · Answer 3 · answered Nov 26 '20 at 21:01

We can also split the column and use %in%

library(dplyr)
library(tidyr)
df %>% 
    mutate(rn = row_number()) %>% 
    separate_rows(fruits) %>%
    group_by(rn) %>% 
    filter(any(c('Apple', 'Orange') %in% fruits)) %>% 
    summarise(fruits = toString(fruits), .groups = 'drop') %>% 
    select(-rn)

get rows of where the column has either one or both of the strings in R

3 Answers3