Optimizing a regex in R for substring extraction

Question

I have a follow-up question on a previous answer that can be found here: Split uneven string in R - variable substring and delimiters

In summary, I wanted to extract the bolded text in a string that follows this pattern:

sp|Q2UVX4|CO3_BOVIN **Complement C3** OS=Bos taurus OX=9913 GN=**C3** PE=1 SV=2

Here is a piece of the answer provided by Martin Gal:

protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"), 
                      str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
                      NA_character_),

His answer was excellent, but sometimes I have a mix of species (e.g.: BOVIN and HUMAN), so I wanted to make the code a bit more flexible. I tried with only space (\\s) and capital letters with space ([A-Z]\\s) but the first failed and the second was inaccurate for some strings. Then I mixed Martin's approach with a string ending in capital letters, aiming to select the entire first chunk as the delimiter (e.g.: sp|Q2UVX4|CO3_BOVIN).

To this:

protein_name = ifelse(str_detect(string, "[a-z]{2}\\|(.*?)[A-Z]\\s(.*?)\\sOS=.*"), 
                      str_replace(string, "[a-z]{2}\\|(.*?)[A-Z]\\s(.*?)\\sOS=.*", "\\2")

In this case, what would be the best way to select everything in between the two patterns? The two patterns are "sp" and capital letter followed by one space.
I used (.*?), is this the best approach?

What do you mean by "best"? `.*?` is the simplest, but not the fastest. It does not match line breaks. — Wiktor Stribiżew, Nov 17 '21 at 16:41
You will have t include your `string` as part of the question — Onyambu, Nov 17 '21 at 16:42
Also include the required output since there are multiple BOVIN in the same string — Onyambu, Nov 17 '21 at 16:48
It is kind of unclear what you are doing (after reading this question). It looks like all you need is `str_match(string, '[a-z]{2}\\|[|\\w]*[A-Z]\\s+(.*?)\\s+OS=')[,2]` (see [demo](https://ideone.com/1I9vlK) & [demo](https://regex101.com/r/1E8q2z/1/)). — Wiktor Stribiżew, Nov 17 '21 at 16:54
Thank you both for your comments. What I mean by best is making sure the regex I used will be precise to my case and not allow "false-positives". For example, if I used only `.*` I could risk matching a longer string that follows the same pattern, right? Also, do I need the parenthesis around `.*?` ? @Wiktor, the regex101 you linked is AWESOME. Thank you so much. But I still couldn't understand the `[|\\w]*`. Do you mind explaining it? — Luiz Gustavo, Nov 17 '21 at 18:44

Onyambu · Answer 1 · 2021-11-17T17:40:13.453

This can be solved as follows:

str_extract_all(string, "(?<=(?:BOVIN|HUMAN) )(.*?)(?= OS).*?GN=(\\w+)") %>%
   map_df(~read.table(text=str_replace(.,"OS.*GN", ""), sep="=",
             col.names = c('protein_name', 'gene')), .id='grp')
   grp                                                                protein_name   gene
1    1                                                              Complement C3      C3
2    1                                                                  C3-beta-c      C3
3    1                                                                  C3-beta-c      C3
4    2                                                                Haptoglobin      HP
5    2                                                                Haptoglobin      HP
6    2                                                                Haptoglobin      HP
7    3                                                     Anion exchange protein  SLC4A7
8    4                                        Isoform V3 of Versican core protein    VCAN
9    4                                        Isoform V2 of Versican core protein    VCAN
10   4                                                      Versican core protein    VCAN
11   5 Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris)   KRT10
12   5                                            Keratin, type I cytoskeletal 10   KRT10

You could also use the following. Note that as_tibble is not necessary. Used it for pretty results

unlist(strsplit(string, "\\w{2}=\\w+\\K;", perl = TRUE))%>%
   sub(".*?(?:BOVIN|HUMAN) (.*?)(?= OS).*?GN=(\\w+).*|.*",  "\\1:\\2", ., perl = TRUE) %>%
   read.table(text=., sep=":") %>%
   as_tibble()

 A tibble: 14 x 2
   V1                                                                           V2      
   <chr>                                                                        <chr>   
 1 "Complement C3"                                                              "C3"    
 2 "C3-beta-c"                                                                  "C3"    
 3 "C3-beta-c"                                                                  "C3"    
 4 ""                                                                           ""      
 5 "Haptoglobin"                                                                "HP"    
 6 "Haptoglobin"                                                                "HP"    
 7 "Haptoglobin"                                                                "HP"    
 8 ""                                                                           ""      
 9 "Anion exchange protein"                                                     "SLC4A7"
10 "Isoform V3 of Versican core protein"                                        "VCAN"  
11 "Isoform V2 of Versican core protein"                                        "VCAN"  
12 "Versican core protein"                                                      "VCAN"  
13 "Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris)" "KRT10" 
14 "Keratin, type I cytoskeletal 10"                                            "KRT10"

Thank you for your answer! I'm sorry for not being clear when posting my question, but BOVIN and HUMAN were just two examples. If I want to make it flexible to any word, which will be capitalized entirely, should I substitute `BOVIN|HUMAN` by `[[:upper:]\\w]`? — Luiz Gustavo, Nov 17 '21 at 18:55
I'm sorry @Onyambu, but I tried here and it didn't work. I'm probably doing something stupid. Should it look like this: `".*?(?:\\b[A-Z]+\\b) (.*?)(?= OS).*?GN=(\\w+).*|.*"` I appreciate your help! — Luiz Gustavo, Nov 17 '21 at 22:38

score 2 · Accepted Answer · answered Nov 17 '21 at 18:53

Your "best" pattern is always the one that meets all your requirements. So, always start from defining the requirements: the match should start with..., the following chars can appear here, there... and the match should end with...

So, in your case, it seems you discard intermediate checks and just use

library(stringr)
str_match(string, '[a-z]{2}\\|[|\\w]*[A-Z]\\s+(.*?)\\s+OS=')[,2]

As the stringr::str_match keeps all captures, it helps immensely when you have to match some pattern inside a complext context. [,2] access the contents of Group 1.

The regex matches:

[a-z]{2} - two lowercase ASCII letters (here, there is no problem with performance, when you tell the regex to match a single char repeated X times, this is very efficient)
\| - a | char (again, this is fine, a literal is matched efficiently)
[|\w]* - zero or more | or word chars (this is backtracking prone since the next pattern matches an uppercase letter, which is also a word char, but here, we need this backtracking)
[A-Z] - an uppercase ASCII letter
\s+ - one or more whitespace chars
(.*?) - Group 1: zero or more chars other than line break chars as few as possible (this is the most resource consuming pattern here, as it will be expanded char after char if the subsequent patterns fail to match; also, it does not match line breaks by default, if you have line breaks, you need ((?s:.*?)))
\s+ - one or more whitespace chars
OS= - a OS= substring.

See the regex demo. See the R demo:

string <- 'sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2'
library(stringr)
str_match(string, '[a-z]{2}\\|[|\\w]*[A-Z]\\s+(.*?)\\s+OS=')[,2]

Output:

# => [1] "Complement C3"

If you need to optimize the .*? pattern, you need to read more about and learn to use unroll-the-loop approach. Tl;dr:

[a-z]{2}\|[|\w]*[A-Z]\s+(\S*(?:\s(?!\s*OS=)\S*)*)\s+OS=

See this regex demo.

The .*? is transformed into \S*(?:\s(?!\s*OS=)\S*)* (see the subsequent pattern is "sewn into" this construct), which matches

\S* - zero or more non-whitespace chars
(?:\s(?!\s*OS=)\S*)* - zero or more sequences of any whitespace that is not immediately followed with zero or more whitespaces and OS=, and then again zero or more non-whitespace chars.

Thank you so much, @Wiktor, for such a thorough response! The "unroll-the-loop" approach is impressive. It seems there is a ~40% reduction in the processing time. I can only imagine the impact on a huge list of strings. Wow! I'm really sorry, but the `[|\w]*` is still not 100% clear. Why you don't have to use ` \\\` before the `|`? Also, just to double-check, does `[|\w]*` stop once an empty character is found, and then `[A-Z]\\s+` allows the regexp to move forward? — Luiz Gustavo, Nov 17 '21 at 22:52
@LuizGustavo 1) Only special chars should be escaped in regex patterns. `|` inside a character class is NOT a special char, so no need escaping. 2) `[|\w]*` matches letters, digits, `_`, and `|` chars, so it does not match whitespace (if you mean "empty character" is a whitespace char, then yes, it will "stop once an empty character is found"). — Wiktor Stribiżew, Nov 17 '21 at 22:58

Optimizing a regex in R for substring extraction

2 Answers2

Linked