Does a regular expression exist for enzymatic cleavage?

Question

Does a regular expression exist for (theoretical) tryptic cleavage of protein sequences? The cleavage rule for trypsin is: after R or K, but not before P.

Example:

Cleavage of the sequence VGTKCCTKPESERMPCTEDYLSLILNR should result in these 3 sequences (peptides):

 VGTK
 CCTKPESER
 MPCTEDYLSLILNR

Note that there is no cleavage after K in the second peptide (because P comes after K).

In Perl (it could just as well have been in C#, Python or Ruby):

  my $seq = 'VGTRCCTKPESERMPCTEDYLSLILNR';
  my @peptides = split /someRegularExpression/, $seq;

I have used this work-around (where a cut marker, =, is first inserted in the sequence and removed again if P is immediately after the cut maker):

  my $seq      = 'VGTRCCTKPESERMPCTEDYLSLILNR';
  $seq         =~ s/([RK])/$1=/g; #Main cut rule.
  $seq         =~ s/=P/P/g;       #The exception.
  my @peptides = split( /=/, $seq);

But this requires modification to a string that can potentially be very long and there can be millions of sequences. Is there a way where a regular expression can be used with split? If yes, what would the regular expression be?

Test platform: Windows XP 64 bit. ActivePerl 64 bit. From perl -v: v5.10.0 built for MSWin32-x64-multi-thread.

@unknown: Context... From http://en.wikipedia.org/wiki/Tryptic: "Trypsins are considered endopeptidases, i.e., the cleavage occurs within the polypeptide chain rather than at the terminal amino acids located at the ends of polypeptides." — Peter Mortensen, Dec 04 '09 at 20:18
Possibly the best SO question ever. How many others can boast code, science, and sexual inferences all at once, whilst at the same time being completely valid and answerable? — shuckster, Dec 06 '09 at 00:02

Gabriel Reid · Accepted Answer · 2009-12-04T20:51:58.013

16

You indeed need to use the combination of a positive lookbehind and a negative lookahead. The correct (Perl) syntax is as follows:

my @peptides = split(/(?!P)(?<=[RK])/, $seq);

edited Dec 04 '09 at 20:51

answered Dec 04 '09 at 19:30

Gabriel Reid

2,506
18
20

2

Sure you mean negative lookahead and positive lookbehind. – Anon. Dec 04 '09 at 19:32

Gumbo · Answer 2 · 2009-12-04T19:40:01.680

6

You could use look-around assertions to exclude that cases. Something like this should work:

split(/(?<=[RK](?!P))/, $seq)

edited Dec 04 '09 at 19:40

answered Dec 04 '09 at 19:20

Gumbo

643,351
109
780
844

Apologies if I am wrong, but wouldn't this end up splitting before the R/K in the sequence, rather than after? – Anon. Dec 04 '09 at 19:27
Indeed, this won't work. The RK needs to be a positive lookbehind (?<=...) – Gabriel Reid Dec 04 '09 at 19:31
@Anon and gab: yes, it results in cut before; 4 peptides: VGT, RCCTKPESE, RMPCTEDYLSLILN and R – Peter Mortensen Dec 04 '09 at 19:36

Anon. · Answer 3 · 2009-12-04T19:48:41.627

4

You can use lookaheads and lookbehinds to match this stuff while still getting the correct position.

/(?<=[RK])(?!P)/

Should end up splitting on a point after an R or K that is not followed by a P.

edited Dec 04 '09 at 19:48

answered Dec 04 '09 at 19:23

Anon.

58,739
8
81
86

`/(?<[RK])(?=[^P])/` avoids splitting off an empty string at the end – ysth Dec 04 '09 at 19:31
For both: not accepted at compile time. For ysth's: "Sequence (?<[...) not recognized in regex; marked by <-- HERE in m/(?<[ <-- HERE RK])(?=[^P])/". I have updated the question with platform information. – Peter Mortensen Dec 04 '09 at 19:43
Sorry, I messed up the positive lookbehind syntax. It should be `(?<=...`. I'll correct it. – Anon. Dec 04 '09 at 19:48
and I copied his error :( - it should be `/(?<=[RK])(?=[^P])/` - but since split by default removes trailing empty fields, it would only matter if you were splitting a fixed number of fields or using the regex with something other than split. – ysth Dec 04 '09 at 20:01

score 1 · Answer 4 · edited Dec 14 '09 at 22:25

1

In Python you can use the finditer method to return non-overlapping pattern matches including start and span information. You can then store the string offsets instead of rebuilding the string.

edited Dec 14 '09 at 22:25

Peter Mortensen

30,738
21
105
131

answered Dec 04 '09 at 19:25

Joe Koberg

25,416
6
48
54

2

perl can do this as well. See http://stackoverflow.com/questions/467800/is-there-a-perl-equivalent-of-pythons-re-findall-re-finditer-iterative-regex-re – dwarring Dec 04 '09 at 20:01

Does a regular expression exist for enzymatic cleavage?

4 Answers4