How to remove text in character vector which do not start with select text

Question

So to begin I have a df of metabolic pathways from PATRIC which contains a column called Product which looks like:

Glycyl-tRNA synthetase beta chain (EC 6.1.1.14)
1,3-propanediol dehydrogenase (EC 1.1.1.202)
Glycine dehydrogenase [decarboxylating] (glycine cleavage system P2 protein) (EC 1.4.4.2)
2',3'-cyclic-nucleotide 2'-phosphodiesterase (EC 3.1.4.16) / 3'-nucleotidase (EC 3.1.3.6)
1,4-alpha-glucan (glycogen) branching enzyme, GH-13-type (EC 2.4.1.18)

I have extracted the EC numbers using the second answer here. Some of the text above had more than one set of parentheses, so I now have a vector which looks like:

[[1]]
[1] "EC 6.1.1.14"

[[2]]
[1] "EC 1.1.1.202"

[[3]]
[1] "glycine cleavage system P2 protein" "EC 1.4.4.2"                        

[[4]]
[1] "EC 3.1.4.16" "EC 3.1.3.6" 

[[5]]
[1] "glycogen"    "EC 2.4.1.18"

How do I remove text within"" that does not start with EC. Also lines with two EC numbers should be divided with a /, if possible.

Desired output as per Mr. Flicks answer

# [1] "EC 6.1.1.14"            "EC 1.1.1.202"           "EC 1.4.4.2"            

# [4] "EC 3.1.4.16/EC 3.1.3.6" "EC 2.4.1.18"

My working example:

nan <- structure(list(Accession = structure(c(1L, 2L, 1L, 1L, 3L), .Label = c("1485.142.con.0001","1485.142.con.0002", "1485.142.con.0009"), class = "factor"),PATRIC.ID = structure(c(2L, 3L, 1L, 5L, 4L), .Label = c("fig|1485.142.peg.1066","fig|1485.142.peg.1362", "fig|1485.142.peg.2123", "fig|1485.142.peg.3103","fig|1485.142.peg.561"), class = "factor"), Product = structure(c(5L,1L, 4L, 3L, 2L), .Label = c("1,3-propanediol dehydrogenase (EC 1.1.1.202)","1,4-alpha-glucan (glycogen) branching enzyme, GH-13-type (EC 2.4.1.18)","2,3-cyclic-nucleotide 2-phosphodiesterase (EC 3.1.4.16) / 3-nucleotidase (EC 3.1.3.6)","Glycine dehydrogenase [decarboxylating] (glycine cleavage system P2 protein) (EC 1.4.4.2)","Glycyl-tRNA synthetase beta chain (EC 6.1.1.14)"), class = "factor")), .Names = c("Accession","PATRIC.ID", "Product"), row.names = c(NA, 5L), class = "data.frame") 

#Extract text from parentheses and make into list
blah <- regmatches(nan$Product,gregexpr("(?<=\\().*?(?=\\))", nan$Product, perl=TRUE))

What do you mean "lines with two EC numbers should be divided with a / if possible"? Can you post your desired output too? — A5C1D2H2I1M1N2O1R2T1, Dec 13 '17 at 16:20

score 2 · Accepted Answer · answered Dec 13 '17 at 16:21

If you want to match only those with EC's then just add that to your expression

blah <- regmatches(nan$Product,gregexpr("(?<=\\()EC.*?(?=\\))", nan$Product, perl=TRUE))

if you want to join multiples with a slash, use paste()

sapply(blah, paste0, collapse="/")
# [1] "EC 6.1.1.14"            "EC 1.1.1.202"           "EC 1.4.4.2"            
# [4] "EC 3.1.4.16/EC 3.1.3.6" "EC 2.4.1.18"

How to remove text in character vector which do not start with select text

1 Answers1