I'm working on interjections with duplicate letters, for example:
x <- c("hallelujah", "hey", "tarrah", "yeah", "bubye", "eureka", "aha", "cooee", "helloee") # "hey" and "yeah" are included as controls
I understand that backreference is key to matching duplicates. For example, (\\w)\\1
matches interjections that contain a duplicated letter in immediately next position:
grep("(\\w)\\1", x, value = T)
[1] "hallelujah" "tarrah" "cooee" "helloee"
However, what I don't understand is why this extended pattern fails to match all interjections that contain a duplicated letter whatever the position of the dupe although the *
quantifier should allow the preceding .
to occur any number of times including zero:
grep("(\\w).*\\1", x, value = T)
[1] "hallelujah" "bubye" "eureka" "aha"
and why, to match exhaustively, the pattern must include .*
at the beginning:
grep(".*(\\w).*\\1", x, value = T)
[1] "hallelujah" "tarrah" "bubye" "eureka" "aha" "cooee" "helloee"
Can anybody explain why?