21

Suppose I have the following data frame:

User.Id    Tags
34234      imageUploaded,people.jpg,more,comma,separated,stuff
34234      imageUploaded
12345      people.jpg

How might I use grep (or some other tool) to only grab rows that include both "imageUploaded" and "people"? In other words, how might I create a subset that includes just the rows with the strings "imageUploaded" AND "people.jpg", regardless of order.

I have tried:

data.people<-data[grep("imageUploaded|people.jpg",results$Tags),]
data.people<-data[grep("imageUploaded?=people.jpg",results$Tags),]

Is there an AND operator? Or perhaps another way to get the intended result?

oguz ismail
  • 1
  • 16
  • 47
  • 69
Rob
  • 313
  • 1
  • 3
  • 11

4 Answers4

26

Thanks to this answer, this regex seems to work. You want to use grepl() which returns a logical to index into your data object. I won't claim to fully understand the inner workings of the regex, but regardless:

x <- c("imageUploaded,people.jpg,more,comma,separated,stuff", "imageUploaded", "people.jpg")

grepl("(?=.*imageUploaded)(?=.*people\\.jpg)", x, perl = TRUE)
#-----
[1]  TRUE FALSE FALSE
Community
  • 1
  • 1
Chase
  • 67,710
  • 18
  • 144
  • 161
  • 1
    I had bee playing around with `grep("(?=imageUploaded)(?=people\\.jpg)"` and not getting success, so the secrets appear to be a) the `perl=TRUE`, b) the parens for grouping and c) the leading `.*` after the `?=` – IRTFM Nov 02 '12 at 01:02
  • The `people\\.jpg` didn't work so I just did `data.people<-data[grepl("(?=.*imageUploaded)(?=.*people)",data$Tags,perl=TRUE),]` I didn't need to match the .jpg extension, but I'm curious to know how to do that using grep. – Rob Nov 02 '12 at 01:27
  • 1
    Is there a way to do AND pattern that doesn't require perl=TRUE? So that it can be used for list.files etc? – Jan Stanstrup Aug 24 '16 at 14:16
15

I love @Chase's answer, and it makes good sense to me, but it can be a bit dangerous to use constructs that one doesn't totally understand.

This answer is meant to reassure anyone who'd like to use @thelatemail's more straightforward approach that it works just as well and is completely competitive speedwise. It's certainly what I'd use in this case. (It's also reassuring that the more sophisticated Perl-compatible-regex pays no performance cost for its power and easy extensibility.)

library(rbenchmark)
x <- paste0(sample(letters, 1e6, replace=T), ## A longer vector of
            sample(letters, 1e6, replace=T)) ## possible matches

## Both methods give identical results
tlm <- grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE)
pat <- "(?=.*a)(?=.*b)"
Chase <- grepl(pat, x, perl=TRUE)
identical(tlm, Chase)
# [1] TRUE    

## Both methods are similarly fast
benchmark(
    tlm = grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE),
    Chase = grepl(pat, x, perl=TRUE))
#          test replications elapsed relative user.self sys.self
# 2       Chase          100    9.89    1.105      9.80     0.10
# 1 thelatemail          100    8.95    1.000      8.47     0.48
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
11

For readability's sake, you could just do:

x <- c(
       "imageUploaded,people.jpg,more,comma,separated,stuff",
       "imageUploaded",
       "people.jpg"
       )

xmatches <- intersect(
                      grep("imageUploaded",x,fixed=TRUE),
                      grep("people.jpg",x,fixed=TRUE)
                     )
x[xmatches]
[1] "imageUploaded,people.jpg,more,comma,separated,stuff"
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Of course the simple answer is just to apply grep twice (once to extract rows with 'imageUploaded' and on that result, extract rows with 'people.jpg'). Your solution does not create a temporary object (which might be large). – Matthew Lundberg Nov 02 '12 at 01:46
1

Below is an alternative to grep using hadley's stringr::str_detect(). This avoids the use of perl=true @jan-stanstrup. Additionally, the dplyr::filter() will return the rows within the dataframe itself so you never need to leave the df.

library(stringr)
libary(dplyr)
 x <- data.frame(User.Id =c(34234,34234,12345), 
                 Tags=c("imageUploaded,people.jpg,more,comma,separated,stuff",
                        "imageUploaded",
                        "people.jpg"))

 data.people <- x %>% filter(str_detect(Tags,"(?=.*imageUploaded)(?=.*people\\.jpg)"))
 data.people

# returns
#  User.Id                                                Tags
# 1   34234 imageUploaded,people.jpg,more,comma,separated,stuff

This is simpler and works if "people.jpg" always follows "imageUploaded"

str_extract(x,"imageUploaded.*people\\.jpg")
Wai Ho Choy
  • 346
  • 2
  • 5
  • Is it possible to pass a vector of stings to match on? How would you match on both elements in the following vector with the AND operator, c("imageUploaded", "people"). – mdb_ftl Nov 12 '20 at 03:51