Finding rows for which at least one keywords is partially matched

Question

I have a data frame df with column a with text. I also have a list of words

keywords <- c("a", "b", "c")

How to find all rows of df where at least one of the keywords are contained in df$a?

For example if df$a is:

hj**a**jk
fgfg
re

first row will be returned.

I would prefer a solution which makes use of the dplyr package

Try `df[sapply(df$a, function(i)grepl(paste0(keywords, collapse = '|'), i)),]` — Sotos, Jun 04 '18 at 14:10
this is what I get: `a<-df[sapply(final_data$customer, function(i) grepl(paste0(Is_med_word, collapse = '|'), i)),] Error in df[sapply(final_data$customer, function(i) grepl(paste0(Is_med_word, : object of type 'closure' is not subsettable` — user3575876, Jun 04 '18 at 14:29
Then I recommend you share a [reproducible example](http://stackoverflow.com/questions/5963269) of your data. However, shouldn't you have `final_data[sapply(...)]` instead of `df[sapply(final_data$....)]` — Sotos, Jun 04 '18 at 14:30

score 2 · Answer 1 · answered Jun 04 '18 at 14:30

Here are 2 tidyverse ways. I added an additional entry to your vector in order to check that all keywords would be checked against, not just the first one.

Since you said this is df$a, I made a tibble df, of which a is the only column, just to fit better with dplyr operations that are generally data frame based.

library(tidyverse)

a <- c("hj**a**jk", "fgfg", "re", "rec")
df <- tibble(a = a)
keywords <- c("a", "b", "c")

The more dplyr way is to start with the data frame, and then pipe it into your filtering operation. The problem is that stringr::str_detect works oddly here—it expects to be looking for matches along an entire vector, when in this case we want that to happen for each row. Adding rowwise lets you do that, and filter df for just rows where the value in a matches any of the keywords.

df %>%
  rowwise() %>%
  filter(str_detect(a, keywords) %>% any())
#> Source: local data frame [2 x 1]
#> Groups: <by row>
#> 
#> # A tibble: 2 x 1
#>   a        
#>   <chr>    
#> 1 hj**a**jk
#> 2 rec

This second way was more intuitive for me, but fits less in the dplyr way. I mapped over a—not the column in df, but just the standalone character vector—to check for any matches. Then I used this as my filtering criteria. Normally dplyr operations are setup so the value you're piping in is the first argument of the function, generally a data frame. But because I was actually piping in the second argument to filter, not the first, I specified df for the first argument, and used the shorthand . for the second.

a %>%
  map_lgl(~str_detect(., keywords) %>% any()) %>%
  filter(df, .)
#> # A tibble: 2 x 1
#>   a        
#>   <chr>    
#> 1 hj**a**jk
#> 2 rec

Created on 2018-06-04 by the reprex package (v0.2.0).

Thank you for this great explanation!! – user3575876 Jun 04 '18 at 15:28 — user3575876, Jun 04 '18 at 15:28

Finding rows for which at least one keywords is partially matched

1 Answers1