I have a follow-up question on a previous answer that can be found here: Split uneven string in R - variable substring and delimiters
In summary, I wanted to extract the bolded text in a string that follows this pattern:
sp|Q2UVX4|CO3_BOVIN **Complement C3** OS=Bos taurus OX=9913 GN=**C3** PE=1 SV=2
Here is a piece of the answer provided by Martin Gal:
protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"),
str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
NA_character_),
His answer was excellent, but sometimes I have a mix of species (e.g.: BOVIN and HUMAN), so I wanted to make the code a bit more flexible. I tried with only space (\\s)
and capital letters with space ([A-Z]\\s)
but the first failed and the second was inaccurate for some strings. Then I mixed Martin's approach with a string ending in capital letters, aiming to select the entire first chunk as the delimiter (e.g.: sp|Q2UVX4|CO3_BOVIN).
To this:
protein_name = ifelse(str_detect(string, "[a-z]{2}\\|(.*?)[A-Z]\\s(.*?)\\sOS=.*"),
str_replace(string, "[a-z]{2}\\|(.*?)[A-Z]\\s(.*?)\\sOS=.*", "\\2")
- In this case, what would be the best way to select everything in between the two patterns? The two patterns are "sp" and capital letter followed by one space.
- I used
(.*?)
, is this the best approach?