33

I need to match any 'r' that is preceded by two different vowels. For example, 'our' or 'pear' would be matching but 'bar' or 'aar' wouldn't. I did manage to match for the two different vowels, but I still can't make that the condition (...) of lookbehind for the ensuing 'r'. Neither (?<=...)r nor ...\\Kr yields any results. Any ideas?

x <- c('([aeiou])(?!\\1)(?=(?1))')
y <- c('our','pear','bar','aar')
y[grepl(paste0(x,collapse=''),y,perl=T)]
## [1] "our"  "pear"`
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
dasf
  • 1,035
  • 9
  • 16
  • 6
    Classical use case for skip&fail verbs: https://regex101.com/r/qC9kO2/1 – HamZa Apr 13 '15 at 12:59
  • A brute force approach: `combos<-combn(c("a","e","i","o","u"),2);grepl(paste0("(",paste(c(paste0(combos[1,],combos[2,]),paste0(combos[2,],combos[1,])),collapse="|"),")r"),y)`. Very ugly, don't think it is good enough for an answer :) – nicola Apr 13 '15 at 13:01
  • @HamZa, why not an answer? – Pouya Apr 13 '15 at 13:03
  • @Pouya I'm playing around. That was just a comment and not a full answer. You could expand on it with explanation and post it as an answer. – HamZa Apr 13 '15 at 13:05

4 Answers4

21

These two solutions seem to work:

the why not way:

x <- '(?<=a[eiou]|e[aiou]|i[aeou]|o[aeiu]|u[aeio])r'
y[grepl(x, y, perl=T)]

the \K way:

x <- '([aeiou])(?!\\1)[aeiou]\\Kr'
y[grepl(x, y, perl=T)]

The why not way variant (may be more efficient because it searches the "r" before):

x <- 'r(?<=a[eiou]r|e[aiou]r|i[aeou]r|o[aeiu]r|u[aeio]r)'

or to quickly exclude "r" not preceded by two vowels (without to test the whole alternation)

x <- 'r(?<=[aeiou][aeiou]r)(?<=a[eiou]r|e[aiou]r|i[aeou]r|o[aeiu]r|u[aeio]r)'
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • 1
    Thank you! The second option looks great. – dasf Apr 13 '15 at 13:19
  • 2
    We could also use recursion to *shorten* the pattern. Not very noob friendly: `([aeiou])(?!\\1)(?1)\\Kr` – HamZa Apr 13 '15 at 13:26
  • 1
    @HamZa: not very useful too, but we could. – Casimir et Hippolyte Apr 13 '15 at 13:27
  • Since the actual vowel set of the language I'm working with has a bunch of special characters that I wouldn't want to type out every time, what I did was that I first defined `V <- '[iyeöäɨauo]'`, and then I specified the condition as `'(',V,')(?!\\1)(?1)\\Kr'`. – dasf Apr 13 '15 at 13:31
  • it makes everything before r non-capturing, so that if you need to change any r in that environment to say rr (as is in my case), the substitution won't actually affect the parts of the word before r. – dasf Apr 13 '15 at 13:40
  • 1
    I removed the `c()` and the `paste(collapse="")` because they don't appear to be necessary. Feel free to roll back if these elements are needed. – Tyler Rinker Apr 13 '15 at 13:40
  • @CathG: it's indeed not necessary for the above example, but I assume that the final goal is to use the pattern in a search replace context. – Casimir et Hippolyte Apr 13 '15 at 13:40
  • @CasimiretHippolyte, thanks for the reply, the need would indeed not be the same with `gsub` instead of `grepl` – Cath Apr 13 '15 at 13:43
  • @TylerRinker `paste(collapse='')` does become necessary if i want to specify `V <- '[iyeöäɨauo]`, ergo `'(',V,')(?!\\1)(?1)\\Kr'`. – dasf Apr 13 '15 at 13:49
  • 1
    @dasf I can see why you would want to do that. If this is all you're doing though it might be shorter and less error prone just to write that group out twice. Otherwise you're dealing with a lot more quotes and commas and it's easier to mess something up. It isn't quite as easy to maintain though because if you want to change it in the future you'll need to change it in two locations instead of just one. So there are trade offs either way. With that said if you're using that in more than two places that is probably the way to go. – Dason Apr 13 '15 at 13:55
15

As HamZa points out in the comments using skip and fail verbs is one way to do what we want. Basically we tell it to ignore cases where we have two identical vowels followed by "r"

# The following is the beginning of the regex and isn't just R code
# the ([aeiou]) captures the first vowel, the \\1 references what we captured
# so this gives us the same vowel two times in a row
# which we then follow with an "r"
# Then we tell it to skip/fail for this
([aeiou])\\1r(*SKIP)(*FAIL)

Now we told it to skip those cases so now we tell it "or cases where we have two vowels followed by an 'r'" and since we already eliminated the cases where those two vowels are the same this will get us what we want.

|[aeiou]{2}r

Putting it together we end up with

y <- c('our','pear','bar','aar', "aa", "ae", "are", "aeer", "ssseiras")
grep("([aeiou])\\1r(*SKIP)(*FAIL)|[aeiou]{2}r", y, perl = TRUE, value = TRUE)
#[1] "our"    "pear"    "sseiras"
Dason
  • 60,663
  • 9
  • 131
  • 148
  • Thank you for the explanation. I have a feeling this will be super useful down the road. – dasf Apr 13 '15 at 13:22
6

Here is a less than elegant solution:

y[grepl("[aeiou]{2}r", y, perl=T) & !grepl("(.)\\1r", y, perl=T)]

Probably has some corner case failures where the first set matches at different location than the second set (will have to think about that), but something to get you started.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • 1
    I like this approach. Most of the times we could simplify the process by breaking it down in two or more processes (regexes) +1 – HamZa Apr 13 '15 at 13:29
  • I think just `(?!(.)\1)[aeiou]{2}r` would be nicer - the pattern is *a little* more complicated, but the code is much simpler. – Kobi Apr 14 '15 at 08:29
  • @Kobi, your pattern doesn't do the same thing and would match stuff like "bcaar". The right way to do it is as Casimir shows. – BrodieG Apr 14 '15 at 12:32
  • You are quite mistaken. I think that is a problem with your code, that would not accept `bbraer` (as you've said yourself in your answer). – Kobi Apr 14 '15 at 12:34
  • @Kobi, very interesting, read that as a lookbehind instead of a lookahead. Similar in spirit to Casimir's though, right? – BrodieG Apr 14 '15 at 12:44
  • Yes. I'd add it an answer, but it is too similar to Casimir's answer, and yours. – Kobi Apr 14 '15 at 12:45
4

Another one through negative lookahead assertion.

> y <- c('our','pear','bar','aar', "aa", "ae", "are", "aeer", "ssseiras")
> grep("(?!(?:aa|ee|ii|oo|uu)r)[aeiou][aeiou]r", y, perl=TRUE, value=TRUE)
[1] "our"      "pear"     "ssseiras"

> grep("(?!aa|ee|ii|oo|uu)[aeiou][aeiou]r", y, perl=TRUE, value=TRUE)
[1] "our"      "pear"     "ssseiras"

(?!aa|ee|ii|oo|uu) asserts that the first two chars in the match won't be aa or ee or .... or uu. So this [aeiou][aeiou] would match any two vowels other but it wouldn't be repeated . That's why we set the condition at first. r matches the r which follows the vowels.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274