1

I'm working on interjections with duplicate letters, for example:

x <- c("hallelujah", "hey", "tarrah", "yeah", "bubye", "eureka", "aha", "cooee", "helloee") # "hey" and "yeah" are included as controls 

I understand that backreference is key to matching duplicates. For example, (\\w)\\1 matches interjections that contain a duplicated letter in immediately next position:

grep("(\\w)\\1", x, value = T)
[1] "hallelujah" "tarrah"     "cooee"      "helloee"

However, what I don't understand is why this extended pattern fails to match all interjections that contain a duplicated letter whatever the position of the dupe although the *quantifier should allow the preceding . to occur any number of times including zero:

grep("(\\w).*\\1", x, value = T)
[1] "hallelujah" "bubye"      "eureka"     "aha"

and why, to match exhaustively, the pattern must include .* at the beginning:

grep(".*(\\w).*\\1", x, value = T)
[1] "hallelujah" "tarrah"     "bubye"      "eureka"     "aha"        "cooee"      "helloee"

Can anybody explain why?

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34

0 Answers0