1

I have a large data frame where I want to extract rows based on a column value. My problem is that grep will take all instances of the value (e.g. will take "11" if I wanted to grep "1"). How do I get exact matches? Example below simply illustrates my issue. I only want to grep the "metm1" row but it is grepping all rows even though they are not exact matches.

## make data

df1 <- data.frame(matrix(, nrow=4, ncol=2))
colnames(df1) <- c("met", "dt1")
df1$met <- c("metm11", "metm1", "metm1", "metm12")
df1$dt1 <- c("0.666", "0.777", "0.99", "0.01")

# make list for grep

mets <- "metm1"

# grep

new_df <- as.data.frame(df1[grep(paste(mets, collapse = "|"), df1$met), ])

oguz ismail
  • 1
  • 16
  • 47
  • 69
krtbris
  • 344
  • 1
  • 9

3 Answers3

1

You may place ^ and $ anchors around the search term to force an exact match:

regex <- paste0("^(?:", paste(mets, collapse = "|"), ")$")
new_df <- as.data.frame(df1[grep(regex, df1$met, fixed=TRUE), ])

For reference, the regex pattern being used here in:

^(?:metm1)$
^(?:metm1|metm2|metm3)$   <-- for multiple terms
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Is there any advantage to using `grep` in this case rather than a more direct `%in%` for exact matches? – MrFlick Apr 08 '21 at 08:02
  • @MrFlick That was in the back of my mind when I posted. You're right, the OP should probably use `%in%` for semantic purposes. In any case, I have updated my answer to use `fixed=TRUE`, which should improve performance by largely turning off the regex engine (which doesn't need to be fully used). – Tim Biegeleisen Apr 08 '21 at 08:03
1

You can use simply == to make exact match.

df1[df1$met == mets,]
#    met   dt1
#2 metm1 0.777
#3 metm1  0.99

In case mets is more than one element long use %in% as already pointed out in the comments by @MrFlick.

df1[df1$met %in% mets,]
#    met   dt1
#2 metm1 0.777
#3 metm1  0.99
GKi
  • 37,245
  • 2
  • 26
  • 48
0

Another solution is by using boundary anchors \\b:

df1[grep(paste0("\\b(", paste0(mets, collapse = "|"),")\\b"), df1$met), ]
    met   dt1
2 metm1 0.777
3 metm1  0.99

Using dplyr you'd filter with grepl, which returns TRUE and FALSE whereas grep returns indices of matches:

library(dplyr)
df1 %>%
  filter(grepl(paste0("\\b(", paste0(mets, collapse = "|"),")\\b"), met))
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34