1

I have a column that has a list of items like this

Fruit
Apple
Apple, Orange
Kiwi, Orange, Apple 
Kiwi

I want to get the rows that contain (Apple, Orange). I'm not sure how to do it, I've tried str_detect and filter but none has worked so far to any other advice would be appreciated.

  • extension to this question. Lets say im getting the rows of apple and orange, but its also giving me the rows of pineapple since apple is in that word as well. how do I prevent this? – user14262341 Nov 26 '20 at 22:42
  • You can do that with the `grepl()` answer (bellow) with a explicit regex. You can add in front of the regex expression `(\\s|^)`. `\\s` is the space (" ") character and and `^` represents the start of the string. So by writing `(\\s|^)([Aa]pple|[Oo]range)` you select every string containing "apple" or "orange" that have a space before the word or that is in the bigining of the string (sentence). Finally you can write `df[grepl("(\\s|^)([Aa]pple|[Oo]range)", df$fruits),, drop=FALSE]`. – DoRemy95 Nov 27 '20 at 07:29

3 Answers3

1

Does this work:

library(dplyr)
library(stringr)
df %>% filter(str_detect(Fruit, 'Apple|Orange'))
# A tibble: 3 x 1
  Fruit              
  <chr>              
1 Apple              
2 Apple, Orange      
3 Kiwi, Orange, Apple

Data used:

df
# A tibble: 4 x 1
  Fruit              
  <chr>              
1 Apple              
2 Apple, Orange      
3 Kiwi, Orange, Apple
4 Kiwi     
Karthik S
  • 11,348
  • 2
  • 11
  • 25
0

Personally, I like using grepl() for those kind of problems. You can play around the regex to select rows. (See example here)

df <- data.frame(list("fruits" = c("Apple", "Apple, Orange", "Kiwi, Apple", "Kiwi")))

Visualization of df:

| id | fruits        | 
|----|---------------|
| 1  | Apple         | 
| 2  | Apple, Orange |
| 3  | Kiwi, Apple   |
| 3  | Kiwi          |

Then you can write:

df_only_apples <- df[grepl("[Aa]pple", df$fruits),, drop=FALSE]

which will give you

| id | fruits        | 
|----|---------------|
| 1  | Apple         | 
| 2  | Apple, Orange |
| 3  | Kiwi, Apple   |

But if you want to select rows that contain "Apples" and "Oranges" you could just write df[grepl("([Aa]pple|[Oo]range)", df$fruits)

DoRemy95
  • 614
  • 3
  • 19
  • yeah for some reason this doesnt work if the order isn't the same? like if it had been orange, kiwi, apple this does not work – user14262341 Nov 26 '20 at 22:43
  • The order of the words in a sentence should not affect the results from grepl(). I also tested the regex `([Aa]pple|[Oo]range)` which works fine now. – DoRemy95 Nov 27 '20 at 07:16
0

We can also split the column and use %in%

library(dplyr)
library(tidyr)
df %>% 
    mutate(rn = row_number()) %>% 
    separate_rows(fruits) %>%
    group_by(rn) %>% 
    filter(any(c('Apple', 'Orange') %in% fruits)) %>% 
    summarise(fruits = toString(fruits), .groups = 'drop') %>% 
    select(-rn)
akrun
  • 874,273
  • 37
  • 540
  • 662