177

I am trying to use grep to test whether a vector of strings are present in an another vector or not, and to output the values that are present (the matching patterns).

I have a data frame like this:

FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6

I have a vector of strings patterns to be found in the "Letter" columns, for example: c("A1", "A9", "A6").

I would like to check whether the any of the strings in the pattern vector is present in the "Letter" column. If they are, I would like the output of unique values.

The problem is, I don't know how to use grep with multiple patterns. I tried:

matches <- unique (
    grep("A1| A9 | A6", myfile$Letter, value=TRUE, fixed=TRUE)
)

But it gives me 0 matches which is not true, any suggestions?

zx8754
  • 52,746
  • 12
  • 114
  • 209
user971102
  • 3,005
  • 4
  • 30
  • 37
  • 3
    You can't use `fixed=TRUE` cause you pattern is _true_ regular expression. – Marek Oct 05 '11 at 15:27
  • 6
    Using `match` or `%in%` or even `==` is the *only* correct way to compare exact matches. regex is very dangerous for such a task and can lead to unexpected results. – David Arenburg Sep 12 '16 at 05:34

11 Answers11

334

In addition to @Marek's comment about not including fixed==TRUE, you also need to not have the spaces in your regular expression. It should be "A1|A9|A6".

You also mention that there are lots of patterns. Assuming that they are in a vector

toMatch <- c("A1", "A9", "A6")

Then you can create your regular expression directly using paste and collapse = "|".

matches <- unique (grep(paste(toMatch,collapse="|"), 
                        myfile$Letter, value=TRUE))
Henrik
  • 65,555
  • 14
  • 143
  • 159
Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
  • 1
    Any way to do this when your list of strings includes regex operators as punctuation? – user124123 Jan 27 '15 at 17:10
  • @user1987097 It should work the same way, with or without any other regex operators. Did you have a specific example this didn't work for? – Brian Diggs Feb 04 '15 at 18:26
  • @user1987097 use 2 backslahes before a dot or bracket. First backslash is an escape character to interpret the second one needed to disable the operator. – mbh86 Mar 11 '16 at 14:48
  • 3
    Using regex for exact matches seem dangerous to me and can have unexpected results. Why not just `toMatch %in% myfile$Letter` ? – David Arenburg Sep 12 '16 at 05:30
  • @user4050 No specific reason. The version in the question had it and I probably just carried it through without thinking about whether it was necessary. – Brian Diggs Jun 01 '17 at 03:16
  • method also works for matching multiple patterns not in a dataframe, but within a character vector. – Momchill Nov 30 '20 at 15:55
44

Good answers, however don't forget about filter() from dplyr:

patterns <- c("A1", "A9", "A6")
>your_df
  FirstName Letter
1      Alex     A1
2      Alex     A6
3      Alex     A7
4       Bob     A1
5     Chris     A9
6     Chris     A6

result <- filter(your_df, grepl(paste(patterns, collapse="|"), Letter))

>result
  FirstName Letter
1      Alex     A1
2      Alex     A6
3       Bob     A1
4     Chris     A9
5     Chris     A6
Adamm
  • 2,150
  • 22
  • 30
  • 3
    I think that `grepl` works with one pattern at the time (we need vector with length 1), we have 3 patterns (vector of length 3), so we can combine them with one using some friendly for grepl separator - `|`, try your luck with other :) – Adamm Feb 23 '18 at 09:16
  • 3
    oh I get it now. So its a compress way to output something like A1 | A2 so if one wanted all conditions then the collapse would be with an & sign, cool thanks. – Ahdee Feb 23 '18 at 15:41
  • 1
    Hi, using `)|(` to separate patterns might make this more robust: `paste0("(", paste(patterns, collapse=")|("),")")`. Unfortunately it becomes also slightly less elegent. This results in pattern `(A1)|(A9)|(A6)`. – fabern Jul 09 '19 at 16:09
41

This should work:

grep(pattern = 'A1|A9|A6', x = myfile$Letter)

Or even more simply:

library(data.table)
myfile$Letter %like% 'A1|A9|A6'
petermeissner
  • 12,234
  • 5
  • 63
  • 63
BOC
  • 427
  • 4
  • 2
  • 13
    `%like%` isn't in base R, so you should mention what package(s) are needed to use it. – Gregor Thomas Nov 01 '18 at 16:39
  • 2
    For others looking at this answer, `%like%` is part of the `data.table` package. Also similar in `data.table` are `like(...)`, `%ilike%`, and `%flike%`. – steveb May 05 '20 at 15:35
10

Based on Brian Digg's post, here are two helpful functions for filtering lists:

#Returns all items in a list that are not contained in toMatch
#toMatch can be a single item or a list of items
exclude <- function (theList, toMatch){
  return(setdiff(theList,include(theList,toMatch)))
}

#Returns all items in a list that ARE contained in toMatch
#toMatch can be a single item or a list of items
include <- function (theList, toMatch){
  matches <- unique (grep(paste(toMatch,collapse="|"), 
                          theList, value=TRUE))
  return(matches)
}
Austin
  • 8,018
  • 2
  • 31
  • 37
6

Have you tried the match() or charmatch() functions?

Example use:

match(c("A1", "A9", "A6"), myfile$Letter)
dwitvliet
  • 7,242
  • 7
  • 36
  • 62
user3877096
  • 77
  • 1
  • 1
  • 4
    One thing to note with `match` is that it is not using patterns, it is expecting an exact match. – steveb May 05 '20 at 15:39
5

To add to Brian Diggs answer.

another way using grepl will return a data frame containing all your values.

toMatch <- myfile$Letter

matches <- myfile[grepl(paste(toMatch, collapse="|"), myfile$Letter), ]

matches

Letter Firstname
1     A1      Alex 
2     A6      Alex 
4     A1       Bob 
5     A9     Chris 
6     A6     Chris

Maybe a bit cleaner... maybe?

DryLabRebel
  • 8,923
  • 3
  • 18
  • 24
4

Not sure whether this answer has already appeared...

For the particular pattern in the question, you can just do it with a single grep() call,

grep("A[169]", myfile$Letter)
BenBarnes
  • 19,114
  • 6
  • 56
  • 74
Assaf
  • 525
  • 5
  • 6
2

Using the sapply

 patterns <- c("A1", "A9", "A6")
         df <- data.frame(name=c("A","Ale","Al","lex","x"),Letters=c("A1","A2","A9","A1","A9"))



   name Letters
1    A      A1
2  Ale      A2
3   Al      A9
4  lex      A1
5    x      A9


 df[unlist(sapply(patterns, grep, df$Letters, USE.NAMES = F)), ]
  name Letters
1    A      A1
4  lex      A1
3   Al      A9
5    x      A9
dondapati
  • 829
  • 6
  • 18
2

Take away the spaces. So do:

matches <- unique(grep("A1|A9|A6", myfile$Letter, value=TRUE, fixed=TRUE))
Saurabh Chauhan
  • 3,161
  • 2
  • 19
  • 46
0

Another option would be using the syntax like '\\b(A1|A9|A6)\\b' as the pattern. This is for regular expressions word boundary which comes in hand for example if Bob had the letters for example "A7,A1", when using that syntax, you can still extract the row. Here is a reproducible example for both options:

df <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex     A7
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df, df[grep('\\b(A1|A9|A6)\\b', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

df2 <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7,A1
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df2
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df2, df2[grep('A1|A9|A6', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

Created on 2022-07-16 by the reprex package (v2.0.1)

Please note: if you are using R v4.1+, you can use \\b, otherwise use \b.

Quinten
  • 35,235
  • 5
  • 20
  • 53
-1

I suggest writing a little script and doing multiple searches with Grep. I've never found a way to search for multiple patterns, and believe me, I've looked!

Like so, your shell file, with an embedded string:

 #!/bin/bash 
 grep *A6* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A7* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A8* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";

Then run by typing myshell.sh.

If you want to be able to pass in the string on the command line, do it like this, with a shell argument--this is bash notation btw:

 #!/bin/bash 
 $stingtomatch = "${1}";
 grep *A6* "${stingtomatch}";
 grep *A7* "${stingtomatch}";
 grep *A8* "${stingtomatch}";

And so forth.

If there are a lot of patterns to match, you can put it in a for loop.

Jaap
  • 81,064
  • 34
  • 182
  • 193
ChrisBean
  • 139
  • 1
  • 1
  • 3
  • Thank you ChrisBean. The patterns are lots actually, and maybe it would be better to use a file then. I am new to BASH, but maybe something like this should work… #!/bin/bash for i in 'pattern.txt' do echo $i j='grep -c "${i}" myfile.txt' echo $j if [$j -eq o ] then echo $i >> matches.txt fi done – user971102 Sep 29 '11 at 15:44
  • doesn't work…the error message is '[grep: command not found'…I have grep in the /bin folder, and /bin is on my $PATH…Not sure what is happening…Can you please help? – user971102 Sep 29 '11 at 16:33