4

Recently here an R question was answered by mrdwab that used a regex that was pretty cool (LINK). I liked the response but can't generalize it because I don't understand what's happening (I fooled with the different numeric values being supplied but that didn't really yield anything useful). Could someone break the regex down piece by piece and explain what's happening?

x <- c("WorkerId", "pio_1_1", "pio_1_2", "pio_1_3", "pio_1_4", "pio_2_1", 
"pio_2_2", "pio_2_3", "pio_2_4")

gsub("([a-z])_([0-9])_([0-9])", "\\1_\\3\\.\\2", x)  #Explain me please

Thank you in advance.

Community
  • 1
  • 1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519

1 Answers1

11

Anywhere you have a character and two numbers separated by underscores (e.g., a_1_2) the regex will select the matched character and numbers and make them available as variables. \\1, \\2, and \\3 refer to the matched arguments in the original expression:

\\1 <- a
\\2 <- 1
\\3 <- 2

The result of running gsub as you have it above is to search an expression for matches and flip the order of the numbers wherever they appear. So, for example, a_1_2 would become a_2.1.

"\\1_\\3\\.\\2"
#  a_  2  .  1
rjz
  • 16,182
  • 3
  • 36
  • 35
  • That first part is really helpful in that it's looking for a character_#_#. Now could you add to your explanation of the `\\1_\\3\\.\\2`? – Tyler Rinker Apr 15 '12 at 16:47
  • 2
    @TylerRinker thats the reordering part, they are called "backreferences". The content of the first bracket (here `([a-z])` is stored in `\1`, the content of the second is stored in `\2`, the third ... in `\3` – stema Apr 15 '12 at 16:53
  • 2
    Thanks @Stema! I've also updated the answer to try and clarify. One other thing worth mentioning is the escape on the `.`: in (most) regex languages, a `.` will match anything. Escaping it will let it render as..well..just a dot. – rjz Apr 15 '12 at 16:55
  • 4
    No one's mentioned it specifically yet but the parenthesis around each piece is what allows the backreferences. (.)([A-z]) allows for \1 and \2. If there was another set around the entire regexp ((.)([A-z])) then you'd also have a third \1 = (.) \2 = ([A-z]) \3 = ((.)([A-z])) Hope it helps. – Rob Apr 15 '12 at 20:46