1

I have a data frame (a tibble, actually) df, with two columns, a and b, and I want to filter out the rows in which a is a substring of b. I've tried

df %>%
  dplyr::filter(grepl(a,b))

but I get a warning that seems to indicate that R is actually applying grepl with the first argument being the whole column a.

Is there any way to apply a regular expression involving two different columns to each row in a tibble (or data frame)?

juan
  • 398
  • 1
  • 14
Daniel Miller
  • 287
  • 4
  • 12
  • It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used for testing. But `grepl` isn't vectorized over the pattern. Perhaps use some `map/Map/mapply` function to help with that. – MrFlick Jul 13 '17 at 21:13
  • `I get a warning that seems to indicate that R is actually applying grepl with the first argument being the whole column` Actually in this case only the first element is used, not the whole column. – Scarabee Jul 13 '17 at 22:19

3 Answers3

4

If you're only interested in by-row comparisons, you can use rowwise():

df <- data.frame(A=letters[1:5],
             B=paste0(letters[3:7],letters[c(2,2,4,3,5)]),
             stringsAsFactors=F)

df %>% 
   rowwise() %>% 
   filter(grepl(A,B))

       A      B
1      b     db
2      e     ge

---------------------------------------------------------------------------------
If you want to know whether row-entry of A is in all of B:

df %>% rowwise() %>% filter(any(grepl(A,df$B)))

      A     B
1     b    db
2     c    ed
3     d    fc
4     e    ge
CPak
  • 13,260
  • 3
  • 30
  • 48
1

Or using base R apply and @Chi-Pak's reproducible example

df <- data.frame(A=letters[1:5],
                 B=paste0(letters[3:7],letters[c(2,2,4,3,5)]),
                 stringsAsFactors=F)

matched <- sapply(1:nrow(df), function(i) grepl(df$A[i], df$B[i]))

df[matched, ]

Result

  A  B
2 b db
5 e ge
Damian
  • 1,385
  • 10
  • 10
1

You can use stringr::str_detect, which is vectorised over both string and pattern. (Whereas, as you noted, grepl is only vectorised over its string argument.)

Using @Chi Pak's example:

library(dplyr)
library(stringr)

df %>% 
  filter(str_detect(B, fixed(A)))
#   A  B
# 1 b db
# 2 e ge
Scarabee
  • 5,437
  • 5
  • 29
  • 55