0

I'm using R to do some string processing, and would like to identify the strings that have a certain word root that are not preceded by another word of a certain word root.

Here is a simple toy example. Say I would like to identify the strings that have the word "cat/s" not preceded by "dog/s" anywhere in the string.

 tests = c(
   "dog cat",
   "dogs and cats",
   "dog and cat", 
   "dog and fluffy cats",
   "cats and dogs", 
   "cat and dog",  
   "fluffy cats and fluffy dogs")  

Using this pattern, I can pull the strings that do have dog before cat:

 pattern = "(dog(s|).*)(cat(s|))"
 grep(pattern, tests, perl = TRUE, value = TRUE)

[1] "dog cat"  "dogs and cats"   "dog and cat"   "dog and fluffy cats"

My negative lookbehind is having problems:

 neg_pattern = "(?<!dog(s|).*)(cat(s|))"
 grep(neg_pattern, tests, perl = TRUE, value = TRUE)

Error in grep(neg_pattern, tests, perl = TRUE, value = TRUE) : invalid regular expression

In addition: Warning message: In grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')(cat(s|))'

I understand that .* is not fixed length, so how can I reject strings that have "dog" before "cat" separated by any number of other words?

oguz ismail
  • 1
  • 16
  • 47
  • 69
Nancy
  • 3,989
  • 5
  • 31
  • 49
  • Yeah, your "negative lookahead is having problems" because it is not a lookahead, it is a lookbehind that cannot have a pattern of unknown length. Looks like you just can use a *lookahead* this way - `"^(?!.*dog.*cat).*cat"` – Wiktor Stribiżew Jun 30 '17 at 21:33
  • See http://ideone.com/v6mpjt – Wiktor Stribiżew Jun 30 '17 at 21:38
  • It looks like you can't do what you want in a single regex in R. There's also the same question with a good answer here: https://stackoverflow.com/questions/3796436/whats-the-technical-reason-for-lookbehind-assertion-must-be-fixed-length-in-r – thc Jun 30 '17 at 21:53
  • @WiktorStribiżew I'm trying to understand the word-root component of my question. For example, cats vs cat vs caterpillar... could I use cat(s|erpillar|) etc. – Nancy Jun 30 '17 at 21:55
  • @WiktorStribiżew This one instance works, but I'm also trying to get a more generalized understanding of the syntax. I don't actually care about cats and dogs, after all, my problem is more interesting :P – Nancy Jun 30 '17 at 21:56
  • 1
    Then never oversimplify. Post the real scenario issue details. Someone who is awake will certainly help you. – Wiktor Stribiżew Jun 30 '17 at 21:57
  • @thc Thanks for the link. The lack of easily human-readable examples is a big reason I posted this question. – Nancy Jun 30 '17 at 21:57
  • @WiktorStribiżew My example is sufficient to address the parameters of my actual question-- multiple word endings, strings that do not start with the key word etc. I believe that there is value in template-like questions to help future users. If you have some ideas on how to edit the question to expand the functionality and still retain the neutrality, edits are obviously welcome. Thanks for the help! – Nancy Jun 30 '17 at 21:59
  • *strings that do not start with the key word* is not covered in your question. – Wiktor Stribiżew Jun 30 '17 at 22:00
  • As noted in the above comments, you cannot get what you want from a single regular expression. However, a workaround is to find all the cat strings and then eliminate all of the dog.*cat strings. Try this: `grep("dog.*cat", grep("cat", tests, perl = TRUE, value = TRUE), perl = TRUE, value = TRUE, invert=TRUE)` – G5W Jul 01 '17 at 00:27

1 Answers1

0

I hope that this can help:

tests = c(
  "dog cat",
  "dogs and cats",
  "dog and cat", 
  "dog and fluffy cats",
  "cats and dogs", 
  "cat and dog",  
  "fluffy cats and fluffy dogs"
)

# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]

# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]

tests

[1] "cats and dogs"               "cat and dog"                
[3] "fluffy cats and fluffy dogs"

I'm not sure if you wanted to do this with one expression, but Regex can still be very useful when applied iteratively.

Joshua Daly
  • 606
  • 1
  • 7
  • 16