0

i have strings of amino-acids like this:

x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"

and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:

#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"

as the output. predictably regexpr gives me the greedy solution:

  regmatches(x, regexpr("M.+\\*", x))
 #[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"

i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.

any help would be appreciated.

Community
  • 1
  • 1
stas g
  • 1,503
  • 2
  • 10
  • 20
  • 1
    Have you tried non-greedy? M.+?\\* – lintmouse Oct 19 '15 at 15:57
  • no, but someone has just suggested it in an answer below and it works! – stas g Oct 19 '15 at 15:59
  • I don't understand, you say you know the difference between `greedy` and `non-greedy` ? How can that be ? –  Oct 19 '15 at 16:04
  • @Frank - I've never seen a regex tutorial that explains `greedy` without explaining `non-greedy`. One can't exist without the other. –  Oct 19 '15 at 16:21
  • @Frank - Greed follows quantifiers. Quantifiers are never explained without greed, to know one is to know the other. –  Oct 19 '15 at 16:26
  • @sln Anywho, I'll delete my comments in a few minutes (to clear up the "noise"). If there's more to say, I'm easily found in the "R public" chat. – Frank Oct 19 '15 at 16:31

3 Answers3

3

Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:

R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"                
#"MQLPSSFAALAAQFDQL*"             
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"
nrussell
  • 18,382
  • 4
  • 47
  • 60
  • 1
    thanks, that's perfect! i have tried `gregexpr` before just forgot to include in my question description. but it was the non-greedy `.+` that did the trick. – stas g Oct 19 '15 at 16:01
  • You said "non-overlapping". This fails on `x <- "MABC*MabcMdef*ghi*"` – Pierre L Oct 19 '15 at 16:02
3

I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:

regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"               
#[2] "MQLPSSFAALAAQFDQL*"            
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
Pierre L
  • 28,203
  • 6
  • 47
  • 69
1
M[^*]+\\*

use negated character class.See demo.Also use perl=True option.

https://regex101.com/r/tD0dU9/6

vks
  • 67,027
  • 10
  • 91
  • 124
  • thanks! it works perfectly. i have tried negating `*` but didn't realise that i then didn't need `.` etc. – stas g Oct 19 '15 at 16:02
  • Works fine (giving the same result as the other answer) without `perl=TRUE`. Besides, in R, you'd have to write it all caps. – Frank Oct 19 '15 at 16:03
  • 1
    @Frank we can use `M[^M*]+\\*` where the other answer will fail on `MABC*MabcMdef*ghi*` – vks Oct 19 '15 at 16:04
  • @vks Okay. Seems too late to alter your answer, arguably, as Pierre just posted that. Up to you. – Frank Oct 19 '15 at 16:07