extracting multiple overlapping substrings

Question

i have strings of amino-acids like this:

x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"

and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:

#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"

as the output. predictably regexpr gives me the greedy solution:

  regmatches(x, regexpr("M.+\\*", x))
 #[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"

i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.

any help would be appreciated.

no, but someone has just suggested it in an answer below and it works! — stas g, Oct 19 '15 at 15:59
I don't understand, you say you know the difference between `greedy` and `non-greedy` ? How can that be ? — , Oct 19 '15 at 16:04
@Frank - I've never seen a regex tutorial that explains `greedy` without explaining `non-greedy`. One can't exist without the other. — , Oct 19 '15 at 16:21
@Frank - Greed follows quantifiers. Quantifiers are never explained without greed, to know one is to know the other. — , Oct 19 '15 at 16:26
@sln Anywho, I'll delete my comments in a few minutes (to clear up the "noise"). If there's more to say, I'm easily found in the "R public" chat. — Frank, Oct 19 '15 at 16:31

score 3 · Answer 1 · answered Oct 19 '15 at 15:56

3

Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:

R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"                
#"MQLPSSFAALAAQFDQL*"             
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"

answered Oct 19 '15 at 15:56

nrussell

18,382
4
47
60

1

thanks, that's perfect! i have tried `gregexpr` before just forgot to include in my question description. but it was the non-greedy `.+` that did the trick. – stas g Oct 19 '15 at 16:01
You said "non-overlapping". This fails on `x <- "MABC*MabcMdef*ghi*"` – Pierre L Oct 19 '15 at 16:02

score 3 · Accepted Answer · answered Oct 19 '15 at 16:04

3

I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:

regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"               
#[2] "MQLPSSFAALAAQFDQL*"            
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"

answered Oct 19 '15 at 16:04

Pierre L

28,203
6
47
69

1

yes, that does it, even with intermitten `M`s - thank you – stas g Oct 19 '15 at 16:12

score 1 · Answer 3 · answered Oct 19 '15 at 15:54

1

M[^*]+\\*

use negated character class.See demo.Also use perl=True option.

https://regex101.com/r/tD0dU9/6

answered Oct 19 '15 at 15:54

vks

67,027
10
91
124

thanks! it works perfectly. i have tried negating `*` but didn't realise that i then didn't need `.` etc. – stas g Oct 19 '15 at 16:02
Works fine (giving the same result as the other answer) without `perl=TRUE`. Besides, in R, you'd have to write it all caps. – Frank Oct 19 '15 at 16:03
1

@Frank we can use `M[^M*]+\\*` where the other answer will fail on `MABC*MabcMdef*ghi*` – vks Oct 19 '15 at 16:04
@vks Okay. Seems too late to alter your answer, arguably, as Pierre just posted that. Up to you. – Frank Oct 19 '15 at 16:07

extracting multiple overlapping substrings

3 Answers3