27

Is it possible to use a grepl argument when referring to a list of values, maybe using the %in% operator? I want to take the data below and if the animal name has "dog" or "cat" in it, I want to return a certain value, say, "keep"; if it doesn't have "dog" or "cat", I want to return "discard".

data <- data.frame(animal = sample(c("cat","dog","bird", 'doggy','kittycat'), 50, replace = T))

Now, if I were just to do this by strictly matching values, say, "cat" and "dog', I could use the following approach:

matches <- c("cat","dog")

data$keep <- ifelse(data$animal %in% matches, "Keep", "Discard")

But using grep or grepl only refers to the first argument in the list:

data$keep <- ifelse(grepl(matches, data$animal), "Keep","Discard")

returns

Warning message:
In grepl(matches, data$animal) :
  argument 'pattern' has length > 1 and only the first element will be used

Note, I saw this thread in my search, but this doesn't appear to work: grep using a character vector with multiple patterns

oguz ismail
  • 1
  • 16
  • 47
  • 69
Marc Tulla
  • 1,751
  • 2
  • 20
  • 34
  • I thought Brian Diggs answer to the link question provided the needed code if you left off the `unique`. It's essentially the same as beginneR's answer. – IRTFM Aug 19 '14 at 20:13
  • When you use a function like `sample` without a `set.seed`, it is not considered a reproducible example – David Arenburg Aug 19 '14 at 21:01

3 Answers3

34

You can use an "or" (|) statement inside the regular expression of grepl.

ifelse(grepl("dog|cat", data$animal), "keep", "discard")
# [1] "keep"    "keep"    "discard" "keep"    "keep"    "keep"    "keep"    "discard"
# [9] "keep"    "keep"    "keep"    "keep"    "keep"    "keep"    "discard" "keep"   
#[17] "discard" "keep"    "keep"    "discard" "keep"    "keep"    "discard" "keep"   
#[25] "keep"    "keep"    "keep"    "keep"    "keep"    "keep"    "keep"    "keep"   
#[33] "keep"    "discard" "keep"    "discard" "keep"    "discard" "keep"    "keep"   
#[41] "keep"    "keep"    "keep"    "keep"    "keep"    "keep"    "keep"    "keep"   
#[49] "keep"    "discard"

The regular expression dog|cat tells the regular expression engine to look for either "dog" or "cat", and return the matches for both.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
26

Not sure what you tried but this seems to work:

data$keep <- ifelse(grepl(paste(matches, collapse = "|"), data$animal), "Keep","Discard")

Similar to the answer you linked to.

The trick is using the paste:

paste(matches, collapse = "|")
#[1] "cat|dog"

So it creates a regular expression with either dog OR cat and would also work with a long list of patterns without typing each.

Edit:

In case you are doing this to later on subset the data.frame according to "Keep" and "Discard" entries, you could do this more directly using:

data[grepl(paste(matches, collapse = "|"), data$animal),]

This way, the results of grepl which are TRUE or FALSE are used for the subset.

Community
  • 1
  • 1
talat
  • 68,970
  • 21
  • 126
  • 157
  • 1
    Thanks for using the results as indices, solved a long running annoyance of mine! – Alex Mar 02 '15 at 07:12
16

Try to avoid ifelse as much as possible. This, for example, works nicely

c("Discard", "Keep")[grepl("(dog|cat)", data$animal) + 1]

For a 123 seed you will get

##  [1] "Keep"    "Keep"    "Discard" "Keep"    "Keep"    "Keep"    "Discard" "Keep"   
##  [9] "Discard" "Discard" "Keep"    "Discard" "Keep"    "Discard" "Keep"    "Keep"   
## [17] "Keep"    "Keep"    "Keep"    "Keep"    "Keep"    "Keep"    "Keep"    "Keep"   
## [25] "Keep"    "Keep"    "Discard" "Discard" "Keep"    "Keep"    "Keep"    "Keep"   
## [33] "Keep"    "Keep"    "Keep"    "Discard" "Keep"    "Keep"    "Keep"    "Keep"   
## [41] "Keep"    "Discard" "Discard" "Keep"    "Keep"    "Keep"    "Keep"    "Discard"
## [49] "Keep"    "Keep"   
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • set.seed(123) to seed the random generator. – Fernando Aug 19 '14 at 21:25
  • @RichardScriven, because the OP provided his data set using `sample`, thus I can't provide an output that he can validate without setting a seed – David Arenburg Aug 19 '14 at 21:31
  • 1
    @DavidArenburg, why try to avoid "ifelse"? – Marc Tulla Aug 19 '14 at 21:48
  • @MarcTulla, because `ifelse` (although being vectorized) is relatively slow, especially in cases when you embed several `ifelse` statements. Thus, if I can easily avoid it, I prefer the do so. Although I guess it is a personal choice – David Arenburg Aug 19 '14 at 21:52
  • @DavidArenburg, that makes sense, and I have noticed it slowing things down, especially on big datasets. If I do need to do an "ifelse"-type operation, is the preferred approach beginneR's below? Something like: data[grepl(paste(matches, collapse = "|"), data$animal),] – Marc Tulla Aug 19 '14 at 21:56
  • @MarcTulla, beginneR solution is the best option if you need to subset the data. If you just need to create a `"Discard", "Keep"` vector, I would go with mine :) – David Arenburg Aug 19 '14 at 21:59
  • Makes sense -- I ultimately want to use this vector to subset (though that's beyond the scope of the original question) – Marc Tulla Aug 19 '14 at 22:02
  • Two years later and now I would totally recommend this over anything else. – Rich Scriven Sep 14 '16 at 21:03
  • @DavidArenburg My context is to match col_1 against pattern. If matched, then 'text', if not then col_1. Am I able to get away without using ifelse? – Sweepy Dodo Aug 24 '21 at 14:25
  • @DavidArenburg In response to my previous question: df[ grepl(pattern, col_1, perl = T), col_1 := 'text'] – Sweepy Dodo Sep 01 '21 at 12:00