0

Hey so i have a school project in which i need to split a massive word into smaller words. This is the massive sequence of letters :

'GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASEDLKKHGTVVLTALGGILKKKEGH
HEAELKPLAQSHATKHKIPIKYLEFISDAIIHVLHSKHRPGDFGADAQGAMTKALELFRNDIAAKYKELGFQG' 

and then i need to split it into other smaller separate parts of itself which would look like this :

'GLSDGEWQQVLNVWGK' 
'VEADIAGHGQEVLIR' 
'LFTGHPETLEK' 
'FDK' 
'FK' 
'HLK' 
'TEAEMK' 
'ASEDLK' 
'K'   
'HGTVVLTALGGILK' 
'K' 
'K' 
'EGHHEAELKPLAQSHATK' 
'HK' 
'IPIK' 
'YLEFISDAIIHVLHSK' 
'HRPGDFGADAQGAMTK' 
'ALELFR' 
'NDIAAK' 
'YK' 
'ELGFQG' 

i have no idea how to start on this if you could help pls and thanks

Bijay Regmi
  • 1,187
  • 2
  • 11
  • 25
  • The rules for "look like" are quite important, no? I suspect it's "ends in a specific set of letters". Regardless, `Regex.Split` (capturing the separator) and/or `string.IndexOf/Substring` (and a loop) will likely be useful here. – user2864740 Mar 23 '21 at 22:32
  • See https://stackoverflow.com/questions/2710617/how-to-find-which-delimiter-was-used-during-string-split-vb-net/2710653 , https://www.dotnetperls.com/indexof-vbnet , etc. – user2864740 Mar 23 '21 at 22:37
  • 2
    So, as for "no idea how to start": start by defining the problem/task in *sufficient detail*. At this point, it should be possible to explain the problem to someone, such that they too understand the ask. – user2864740 Mar 23 '21 at 22:39
  • Ok, so the problem that i have to do is a protein sequence that gets cut from the massive sequence that is the first example into the all the smaller parts of itself which is randomized as you can see. i think that it could be an array but i dont really know how to start on this. – Marian Lime Mar 23 '21 at 22:43
  • 2
    As mentioned, it's not clear what the rules are. What makes you cut one string "IPIK" and another that is "YK"? Do you have an array of lengths for the protein sequences? – LarsTech Mar 23 '21 at 22:46
  • Different digestion enzymes cut at different positions within a protein sequence; the most commonly used one is trypsin. It follows the following rules: 1) Cuts the sequence after an arginine (R) 2) Cuts the sequence after a lysine (K) 3) Does not cut if lysine or arginine is followed by proline (P) Consider the following protein sequence for apomyoglobin: 'Following the rules above, the apomyoglobin sequence would produce the following fragments (shown as they appear in the sequence from the amino‐ to the carboxyterminus of a protein): – Marian Lime Mar 23 '21 at 22:50
  • This what the exercise says straight from my school book: obviously could not include the sequence and the cut up parts of it as they are too long but assume that the sequence goes before the first step and the cut parts go after the "of a protein" part – Marian Lime Mar 23 '21 at 22:51
  • So you are looking for Ks and Rs, etc? Start looping and inspecting characters. While looping, if you hit one of those key characters, you start a new line. Something like that. – LarsTech Mar 23 '21 at 22:58
  • Would you be able to show an example of this in vb as that is the part that i am struggling with – Marian Lime Mar 23 '21 at 23:07
  • You are not the only one in your class who is confused about this: https://stackoverflow.com/questions/66764824/how-can-i-replace-a-string-without-the-replace-function – Caius Jard Mar 24 '21 at 00:07
  • thats the same that i am searching for an answer as well – Marian Lime Mar 24 '21 at 00:31
  • 1
    @Caius Jard [String.Split() removes delimiter characters](https://stackoverflow.com/q/66671705/7444103). This is, at least, the fourth *account* or the fourth person that asks the same question. They should probably Team up. – Jimi Mar 24 '21 at 04:51
  • @Jimi [Oh yeah, so there is](https://stackoverflow.com/questions/66756699/having-trouble-developing-protein-sequence-segmentation) Hah.. Perhaps it would be good feedback for their teacher - "your students aren't really getting this one, they're just hitting SO en masse to get their homework done". Schools should probably teach in VB more; fewer posts about it and fewer people answering make it more obvious when this happens - or if SE would set up a homework site and we kick all the questions there, it's a one stop shop for schools to keep an eye on plagiarism – Caius Jard Mar 24 '21 at 05:47

1 Answers1

0

Different digestion enzymes cut at different positions within a protein sequence; the most commonly used one is trypsin. It follows the following rules: 1) Cuts the sequence after an arginine (R) 2) Cuts the sequence after a lysine (K) 3) Does not cut if lysine (K) or arginine (R) is followed by proline (P).

Okay, hooray, rules! Let's turn this into pseudo-code to describe the same algorithm in a half-way state between the original prose and code. (While a Regex.Split approach would still work, this might be a good time to explore some more fundamentals.)

 let the list of words be an empty array
 let current word be the empty string

 for each letter in the input:
     if the letter is R or K and the next letter is NOT P then:
        add the letter to the current word
        save the current word to the list of words
        reset the current word to the empty string
     otherwise:
        add the letter to the current word

 if after the loop the current word is not the empty string then:
     add the current word to the list of words
        

Then let's see how some of these translate. This is incomplete and quite likely contains minor errors1 beyond that which has been called out in comments.

Dim words As New List(Of String)
Dim word = ""

' A basic loop works suitably for string input and it can also be
' modified for a Stream that supports a Peek of the next character.
For i As Integer = 0 To input.Length - 1
    Dim letter = input(i)
    ' TODO/FIX: input(i+1) can access an element off the string. Reader exercise.
    If (letter = "R"C OrElse letter = "K"C) AndAlso Not input(i+1) = "P"C
        ' StringBuilder is more efficient for larger strings
        Set word = word & letter
        words.Add(word) ' or perhaps just WriteLine the word somewhere?
        Set word = ""
    Else
        Set word = word & letter    
    End If
Next

' TODO/FIX: consider when last word is not added to words list yet

1As I use C# (and not VB.NET) the above code comes warranty Free. Here are some quick reference links I used to 'stitch' this together:

user2864740
  • 60,010
  • 15
  • 145
  • 220