3

I wish to extract the portion of a string between the third set of parentheses, preferably using base R. Here is an example data set:

my.data <- read.table(text = '
     my.num                              my.string                                  my.cov
        1    Abc(~1)Fgf(~-1+e2:cp)Bca(~-1+g1+g2:ti+g2:cfi+g2:pp+g2:cp)q(~-1+re:se)    10
        2    Abc(~1)Fgf(~-1+e1:e2:fi)Bca(~-1+g1+g2:ti+g2:pr+g2:ts+g2:cfi)q(~1)        20
        3    Abc(~1)Fgf(~1)Bca(~-1+g1+g2+g2:cp)q(~-1+re:se)                           15
', header = TRUE, stringsAsFactors = FALSE)
my.data

Either of these two results would be helpful:

desired.result1 <- read.table(text = '
     my.num                     my.string            my.cov
        1    Bca(~-1+g1+g2:ti+g2:cfi+g2:pp+g2:cp)      10
        2    Bca(~-1+g1+g2:ti+g2:pr+g2:ts+g2:cfi)      20
        3    Bca(~-1+g1+g2+g2:cp)                      15
', header = TRUE, stringsAsFactors = FALSE)
desired.result1

desired.result2 <- read.table(text = '
     my.num                     my.string       my.cov
        1    ~-1+g1+g2:ti+g2:cfi+g2:pp+g2:cp      10
        2    ~-1+g1+g2:ti+g2:pr+g2:ts+g2:cfi      20
        3    ~-1+g1+g2+g2:cp                      15
', header = TRUE, stringsAsFactors = FALSE)
desired.result2

I am so rusty on regex I am not even sure where to begin and could not locate a similar question on the internet. Thank you for any advice or assistance.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Mark Miller
  • 12,483
  • 23
  • 78
  • 132

2 Answers2

4

Using strsplit:

sapply(strsplit(my.data$my.string, split = "(", fixed = TRUE), function(i){
  strsplit(i[4], split = ")", fixed = TRUE)[[1]][1]})

# [1] "~-1+g1+g2:ti+g2:cfi+g2:pp+g2:cp" "~-1+g1+g2:ti+g2:pr+g2:ts+g2:cfi" "~-1+g1+g2+g2:cp" 
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 2
    or this `paste0(sapply(strsplit(my.data$my.string, '\\)'), '[', 3), ')')` to get the first one – Sotos Feb 20 '17 at 21:36
3

First Expression:

sub(".*?\\(.*?\\).*?\\(.*?\\)(.*?\\(.*?\\)).*", "\\1", my.data$my.string)
[1] "Bca(~-1+g1+g2:ti+g2:cfi+g2:pp+g2:cp)" "Bca(~-1+g1+g2:ti+g2:pr+g2:ts+g2:cfi)"
[3] "Bca(~-1+g1+g2+g2:cp)" 

Second Expression:

sub(".*?\\(.*?\\).*?\\(.*?\\).*?\\((.*?)\\).*", "\\1", my.data$my.string)
[1] "~-1+g1+g2:ti+g2:cfi+g2:pp+g2:cp" "~-1+g1+g2:ti+g2:pr+g2:ts+g2:cfi" "~-1+g1+g2+g2:cp"
G5W
  • 36,531
  • 10
  • 47
  • 80