So to begin I have a df of metabolic pathways from PATRIC which contains a column called Product which looks like:
Glycyl-tRNA synthetase beta chain (EC 6.1.1.14)
1,3-propanediol dehydrogenase (EC 1.1.1.202)
Glycine dehydrogenase [decarboxylating] (glycine cleavage system P2 protein) (EC 1.4.4.2)
2',3'-cyclic-nucleotide 2'-phosphodiesterase (EC 3.1.4.16) / 3'-nucleotidase (EC 3.1.3.6)
1,4-alpha-glucan (glycogen) branching enzyme, GH-13-type (EC 2.4.1.18)
I have extracted the EC numbers using the second answer here. Some of the text above had more than one set of parentheses, so I now have a vector which looks like:
[[1]]
[1] "EC 6.1.1.14"
[[2]]
[1] "EC 1.1.1.202"
[[3]]
[1] "glycine cleavage system P2 protein" "EC 1.4.4.2"
[[4]]
[1] "EC 3.1.4.16" "EC 3.1.3.6"
[[5]]
[1] "glycogen" "EC 2.4.1.18"
How do I remove text within""
that does not start with EC. Also lines with two EC numbers should be divided with a /
, if possible.
Desired output as per Mr. Flicks answer
# [1] "EC 6.1.1.14" "EC 1.1.1.202" "EC 1.4.4.2"
# [4] "EC 3.1.4.16/EC 3.1.3.6" "EC 2.4.1.18"
My working example:
nan <- structure(list(Accession = structure(c(1L, 2L, 1L, 1L, 3L), .Label = c("1485.142.con.0001","1485.142.con.0002", "1485.142.con.0009"), class = "factor"),PATRIC.ID = structure(c(2L, 3L, 1L, 5L, 4L), .Label = c("fig|1485.142.peg.1066","fig|1485.142.peg.1362", "fig|1485.142.peg.2123", "fig|1485.142.peg.3103","fig|1485.142.peg.561"), class = "factor"), Product = structure(c(5L,1L, 4L, 3L, 2L), .Label = c("1,3-propanediol dehydrogenase (EC 1.1.1.202)","1,4-alpha-glucan (glycogen) branching enzyme, GH-13-type (EC 2.4.1.18)","2,3-cyclic-nucleotide 2-phosphodiesterase (EC 3.1.4.16) / 3-nucleotidase (EC 3.1.3.6)","Glycine dehydrogenase [decarboxylating] (glycine cleavage system P2 protein) (EC 1.4.4.2)","Glycyl-tRNA synthetase beta chain (EC 6.1.1.14)"), class = "factor")), .Names = c("Accession","PATRIC.ID", "Product"), row.names = c(NA, 5L), class = "data.frame")
#Extract text from parentheses and make into list
blah <- regmatches(nan$Product,gregexpr("(?<=\\().*?(?=\\))", nan$Product, perl=TRUE))