0

How do I search a sequence for a certain match not containing a certain substring? As in wanting to search an RNA sequence starting with CG and not containing AG in the middle and then ending with it? When I run

regexp(mRNA, 'GU\w+[^AG]AG');

it gives me the location of matches that dont contain either A or G in the middle, and not the AG substring. Would really appreciate the help!

Henkersmann
  • 1,190
  • 8
  • 21
  • Possible duplicate of [Regular expression to match a line that doesn't contain a word?](https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word) – CAustin Feb 08 '18 at 18:04

1 Answers1

0

The following regular expression should work fine:

^(?!GC.*AG.*AG)CG.*AG$

It uses a negative lookahead assertion (for more information read this page). Visit this link for a full working demo.

If you aren't sure that your strings contain only uppercase characters, you can replace regexp with regexpi, which performs a case-insensitive matching. Example:

mRNA = {
 'CGUUAG'
 'CGCGUCAG'
 'AGUCGUAG'
 'CGAGAG'
 'UUAGAGCUUAGC'
 'CGCGCGAG'
 'CGAG'
 'AAGCCU'
 'GACU'
};

regexpi(mRNA,'^(?!CG.*AG.*AG)CG.*AG$')

ans =
      [1]
      [1]
      []
      []
      []
      [1]
      [1]
      []
      []
Tommaso Belluzzo
  • 23,232
  • 8
  • 74
  • 98