1

I'm trying to detect terms using grepl, and I'm getting too many false positives. I was hoping there might be a way to require two successful matches of any term off the list (I have manual coding for a segment of my data and am trying to get the automation to at least roughly correspond to this, but I have about 5 times as many positives as I did with manual coding). I didn't see grepl as taking any argument requiring more than one match to trigger TRUE. Is there any way of requiring two matches to trigger a TRUE finding? Or is there some other function I should be using?

GenericColumn <- cbind(grepl(Genericpattern, Statement$Statement.Text, ignore.case = TRUE))

EDIT:

Here is a more concrete example:

Examplepattern <- 'apple|orange'
ExampleColumn <- cbind(grepl(Examplepattern, Rexample$Statement.Text, ignore.case = TRUE)) 

As is now, all of these will trigger true with grepl. I would only like the items with two references to trigger true.

Example data:

Rexample <- structure(list(Statement.Text = structure(c(2L, 1L, 3L, 5L, 4L
), .Label = c("This apple is a test about an apple.", "This is a test about apples.", 
"This orange is a test about apples.", "This orange is a test about oranges.", 
"This orange is a test."), class = "factor")), .Names = "Statement.Text", row.names = c(NA, 
5L), class = "data.frame")

Desired Output: TRUE, FALSE, TRUE, TRUE, FALSE

Matthew
  • 23
  • 3
  • I'm not sure what the best sort of concrete example is - the actual code doesn't have a ton of information in it. I have a pattern made by: `code` Genericpattern <- paste(Genericlist, sep = " ", collapse = '|') This has about 50 terms in it, which I'm running through a CSV column made up of webscraped text. I'm basically just trying to figure out of there's a way of increasing accuracy because having only one match is too sensitive a metric, it seems, so I was hoping it would have to match two terms (either apples and oranges, or apples and apples, so either two different or same twice). – Matthew Mar 31 '16 at 03:59
  • See [how to write a great reproducible R example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) so we can help you. Really, you should probably be parsing with an HTML parser, not regex. `rvest::html_nodes` should do the trick, paired with CSS or XPath selectors. – alistaire Mar 31 '16 at 04:27
  • I already have the data in a csv, so it's already off the website into a manipulable file - I'm just trying to match terms twice rather than once right now. I will follow that link to see if I can come up with a reproducible example, though it's not so much an error issue as a feature issue. – Matthew Mar 31 '16 at 05:38
  • @alistaire I have added a concrete example (I think) to the initial post in an edit. – Matthew Mar 31 '16 at 05:52

2 Answers2

1

You can try a regex that explicitly looks for the pattern again, like (?:apple|orange).*(?:apple|orange)

(pattern <- paste0("(?:", Examplepattern, ")", ".*", "(?:", Examplepattern, ")"))
#[1] "(?:apple|orange).*(?:apple|orange)"


grepl(pattern, Rexample$Statement.Text, ignore.case = TRUE, perl = TRUE)
#[1] FALSE  TRUE  TRUE FALSE  TRUE
Jota
  • 17,281
  • 7
  • 63
  • 93
1

You can specify how many times you want something repeated in regex with curly braces, like {2} (exactly twice whatever is before it), {2,5} (2-5 times), or {2,} (2 or more times). However, you need to allow for words between the ones you want to match, so you need a wildcard . quantified with * (0 or more times).

Thus, if you want either apple or orange matched twice (including apple and orange and vice versa), you can use

grepl('(apple.*|orange.*){2}', Rexample$Statement.Text, ignore.case = TRUE)
# [1] FALSE  TRUE  TRUE FALSE  TRUE

If you want apple repeated twice or orange repeated twice (but not apple once and orange once), quantify separately:

grepl('(apple.*){2,}|(orange.*){2}', Rexample$Statement.Text, ignore.case = TRUE)
# [1] FALSE  TRUE FALSE FALSE  TRUE
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Nice solution. Looks like we had different interpretations of the desired output. Note that you can use `{2}`, as you'll still get `TRUE` even if "apple" shows up more than 2 times. – Jota Mar 31 '16 at 06:58
  • Yeah, actually I had it arranged to match the other way and switched it. You're right about the `{2,}` with `grepl`; I guess an upper bound is useless with `grepl`. – alistaire Mar 31 '16 at 08:27
  • Is there a way within that to put a {2} to cover the entirety of the list? As in, it doesn't matter to me if any one term appears twice, but if just two in sum do? – Matthew Mar 31 '16 at 15:25
  • @Matthew please provide your desired output. Maybe alistaire can use something like this `"(?:(?:apple|orange).*){2}"` Should the third string return TRUE or FALSE? – Jota Mar 31 '16 at 19:01
  • Updated with options for both, but Jota is right, you should specify your desired output in the question. – alistaire Mar 31 '16 at 19:09
  • Ah, gotcha - I'll update the first post to specify, but I would like the third option to trigger True. – Matthew Mar 31 '16 at 20:47
  • Nest your `paste0`s: `paste0('(', paste0(c('apple', 'orange', 'pear'), collapse = '.*|'), '.*){2}')` – alistaire Mar 31 '16 at 21:23