Different digestion enzymes cut at different positions within a protein sequence; the most commonly used one is trypsin. It follows the following rules: 1) Cuts the sequence after an arginine (R) 2) Cuts the sequence after a lysine (K) 3) Does not cut if lysine (K) or arginine (R) is followed by proline (P).
Okay, hooray, rules! Let's turn this into pseudo-code to describe the same algorithm in a half-way state between the original prose and code. (While a Regex.Split
approach would still work, this might be a good time to explore some more fundamentals.)
let the list of words be an empty array
let current word be the empty string
for each letter in the input:
if the letter is R or K and the next letter is NOT P then:
add the letter to the current word
save the current word to the list of words
reset the current word to the empty string
otherwise:
add the letter to the current word
if after the loop the current word is not the empty string then:
add the current word to the list of words
Then let's see how some of these translate. This is incomplete and quite likely contains minor errors1 beyond that which has been called out in comments.
Dim words As New List(Of String)
Dim word = ""
' A basic loop works suitably for string input and it can also be
' modified for a Stream that supports a Peek of the next character.
For i As Integer = 0 To input.Length - 1
Dim letter = input(i)
' TODO/FIX: input(i+1) can access an element off the string. Reader exercise.
If (letter = "R"C OrElse letter = "K"C) AndAlso Not input(i+1) = "P"C
' StringBuilder is more efficient for larger strings
Set word = word & letter
words.Add(word) ' or perhaps just WriteLine the word somewhere?
Set word = ""
Else
Set word = word & letter
End If
Next
' TODO/FIX: consider when last word is not added to words list yet
1As I use C# (and not VB.NET) the above code comes warranty Free. Here are some quick reference links I used to 'stitch' this together: