In R, how do you compare two columns with a regex, row-by row?

Question

I have a data frame (a tibble, actually) df, with two columns, a and b, and I want to filter out the rows in which a is a substring of b. I've tried

df %>%
  dplyr::filter(grepl(a,b))

but I get a warning that seems to indicate that R is actually applying grepl with the first argument being the whole column a.

Is there any way to apply a regular expression involving two different columns to each row in a tibble (or data frame)?

It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used for testing. But `grepl` isn't vectorized over the pattern. Perhaps use some `map/Map/mapply` function to help with that. — MrFlick, Jul 13 '17 at 21:13
`I get a warning that seems to indicate that R is actually applying grepl with the first argument being the whole column` Actually in this case only the first element is used, not the whole column. — Scarabee, Jul 13 '17 at 22:19

score 4 · Accepted Answer · answered Jul 13 '17 at 21:13

If you're only interested in by-row comparisons, you can use rowwise():

df <- data.frame(A=letters[1:5],
             B=paste0(letters[3:7],letters[c(2,2,4,3,5)]),
             stringsAsFactors=F)

df %>% 
   rowwise() %>% 
   filter(grepl(A,B))

       A      B
1      b     db
2      e     ge

---------------------------------------------------------------------------------
If you want to know whether row-entry of A is in all of B:

df %>% rowwise() %>% filter(any(grepl(A,df$B)))

      A     B
1     b    db
2     c    ed
3     d    fc
4     e    ge

score 1 · Answer 2 · answered Jul 13 '17 at 21:26

Or using base R apply and @Chi-Pak's reproducible example

df <- data.frame(A=letters[1:5],
                 B=paste0(letters[3:7],letters[c(2,2,4,3,5)]),
                 stringsAsFactors=F)

matched <- sapply(1:nrow(df), function(i) grepl(df$A[i], df$B[i]))

df[matched, ]

Result

  A  B
2 b db
5 e ge

Scarabee · Answer 3 · 2017-07-13T22:14:49.780

1

You can use stringr::str_detect, which is vectorised over both string and pattern. (Whereas, as you noted, grepl is only vectorised over its string argument.)

Using @Chi Pak's example:

library(dplyr)
library(stringr)

df %>% 
  filter(str_detect(B, fixed(A)))
#   A  B
# 1 b db
# 2 e ge

edited Jul 13 '17 at 22:14

answered Jul 13 '17 at 22:09

Scarabee

5,437
5
29
55

In R, how do you compare two columns with a regex, row-by row?

3 Answers3

Linked