21
grepl("instance|percentage", labelTest$Text)

will return true if any one of instance or percentage is present.

How will I get true only when both the terms are present?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
toofrellik
  • 1,277
  • 4
  • 15
  • 39
  • how about `grep` once with the "instance" and then do the same with "percentage"? get the replies (as T or F) and combine them ? – amonk May 24 '17 at 08:28
  • See for example https://stackoverflow.com/questions/43803561 – talat May 24 '17 at 08:28
  • i need to populate an excel with this combination, below is code: `labelTest$label[ grep("instance", labelTest$Text)] <- "combination1"` so one with "instance" and other with "percentage" wont work. – toofrellik May 24 '17 at 08:32
  • 1
    `labelTest$label[ grep("instance", labelTest$Text) & grep("percentage", labelTest$Text)] <- "combination1"` is what @agerom was suggesting and should work – FlorianGD May 24 '17 at 08:40
  • Above one doesn't work it is behaving as | operator, as well giving below warning: `longer object length is not a multiple of shorter object length` – toofrellik May 24 '17 at 08:43

3 Answers3

32
Text <- c("instance", "percentage", "n", 
          "instance percentage", "percentage instance")

grepl("instance|percentage", Text)
# TRUE  TRUE FALSE  TRUE  TRUE

grepl("instance.*percentage|percentage.*instance", Text)
# FALSE FALSE FALSE TRUE  TRUE

The latter one works by looking for:

('instance')(any character sequence)('percentage')  
OR  
('percentage')(any character sequence)('instance')

Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.

Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl regex.

# create a vector of word combinations
set.seed(1)
words <- c("instance", "percentage", "element",
           "character", "n", "o", "p")
Text2 <- replicate(10, paste(sample(words, 5), collapse=" "))

# grepl with multiple positive look-ahead
longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)",
  Text2, perl=TRUE)

# this is equivalent to the solution proposed in the comments
longstrd <- grepl("instance", Text2) & 
          grepl("percentage", Text2) & 
             grepl("element", Text2) & 
           grepl("character", Text2)

# they produce identical results
identical(longperl, longstrd)

Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you

pat <- c("instance", "percentage", "element", "character")

longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE)
longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L

As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b. E.g:

tx <- c("cent element", "percentage element", "element cent", "element centimetre")

grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE)
# TRUE FALSE  TRUE FALSE
grepl("element", tx) & grepl("\\bcent\\b", tx)
# TRUE FALSE  TRUE FALSE
AkselA
  • 8,153
  • 2
  • 21
  • 34
  • not sure this works with whole words. For example, replacing `"instance"` with `"table"` also seems to capture cases like `"marketable"`. I tried adding `"\\stable"` to include a space before `"table"` but that doesn't work either. Any suggestions? – val Oct 30 '19 at 14:09
  • 1
    @val: If you use `\\b` instead to indicate a word boundary, it should work. – AkselA Oct 30 '19 at 18:10
4

This is how you will get only "TRUE" if both terms do occur in an item of the vector "labelTest$Text". I think this is the exact answer to the question and much shorter than the other solutions.

grepl("instance",labelTest$Text) & grepl("percentage",labelTest$Text)
0

Use intersect and feed it a grep for each word:

library(data.table) #used for subsetting text vector below

vector_of_text[
  intersect(
    grep(vector_of_text , pattern = "pattern1"),
    grep(vector_of_text , pattern = "pattern2")
  )
]
Das_Geek
  • 2,775
  • 7
  • 20
  • 26
  • I am not seeing a use of data.table in here. Can you clarify? Also, I think you are wanting: vector_of_text[ grep(vector_of_text , pattern = "pattern1") & grep(vector_of_text , pattern = "pattern2") ]. No use of intersect() and an &, but we still have the potential problem that a hit will include strings containing the search term (like "instance1" for "instance") – Rick Pack Dec 11 '19 at 12:12