Possible duplicate Here
I have a data frame of two columns. I want to remove the string in parenthesis and add that as a new column. Data frame is displayed below.
structure(list(ID = 1:12, Gene.Name = structure(c(3L, 11L, 9L,
5L, 1L, 8L, 2L, 4L, 6L, 12L, 10L, 7L), .Label = c(" ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA",
" heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA", " NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA",
" ribosomal protein L34 (RPL34), transcript variant 1, mRNA",
" ribosomal protein S11 (RPS11), mRNA", "ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA",
"clone MGC:10120 IMAGE:3900723, mRNA, complete cds", "cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA",
"farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA", "homeobox protein from AL590526 (LOC84528), mRNA",
"mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA",
"ribosomal protein S15a (RPS15A), mRNA"), class = "factor")), .Names = c("ID",
"Gene.Name"), row.names = c(NA, -12L), class = "data.frame")
if the string in parenthesis is not found, then leave that row empty. Here i have two cases
1) Get all the string in parenthesis and add as a new column separated by ,
2) Last string in parenthesis and add as new column
I tried something like df$Symbol <- sapply(df, function(x) sub("\\).*", "", sub(".*\\(", "", x)))
but does not give the appropriate output
Case 1 output
ID Gene.Name Symbol
1 NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA ubiquinone, (9kD, MLRQ),NDUFA4
2 mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA MRPS33
3 farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA FDFT1
4 ribosomal protein S11 (RPS11), mRNA RPS11
5 ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA oligomycin sensitivity conferring protein,ATP5O
6 cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA CMAS
7 heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA HNRPF
8 ribosomal protein L34 (RPL34), transcript variant 1, mRNA RPL34
9 ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA subunit 9,ATP5G3
10 ribosomal protein S15a (RPS15A), mRNA RPS15A
11 homeobox protein from AL590526 (LOC84528), mRNA LOC84528
12 clone MGC:10120 IMAGE:3900723, mRNA, complete cds NA
Case 2 output
ID Gene.Name Symbol
1 NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA NDUFA4
2 mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA MRPS33
3 farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA FDFT1
4 ribosomal protein S11 (RPS11), mRNA RPS11
5 ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA ATP5O
6 cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA CMAS
7 heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA HNRPF
8 ribosomal protein L34 (RPL34), transcript variant 1, mRNA RPL34
9 ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA ATP5G3
10 ribosomal protein S15a (RPS15A), mRNA RPS15A
11 homeobox protein from AL590526 (LOC84528), mRNA LOC84528
12 clone MGC:10120 IMAGE:3900723, mRNA, complete cds <NA>