I have a large string (string s
), I need to break it up and extract the sequences of the string that starts with ATG
and ends with TGA
The substrings can be any length but they must exist within the string.
s ="CTCGAGACTAGAGCTAGATAAAAAAAATTTTTATTTATTTTTATTTATTTTGAATTAAATAGATTACAAATTAATTAATCCCATCAAATCTTTAAAAAAAAATGGTTTAAAAAAACTTGGGTTGGTTAATTATTATTTGAAAATTTTAAAACCCAAATTAAAAAAAAAAAATGGGATTCAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCAGATTGCATAAAAAGATTTTTTTTTTTTTTTTTTCTTATTTCTTAAAACAAATAAATTAAATTAAATAAAAAATAAAAATCAGATCTTCAACTAGTGGTGGTTCAGGAGGTAGTGTGAGTAAAGGTGAAGAAGATAATATGGCATCGTTACCAGCTACACATGAGTTACATATATTCGGTAGCATTAATGGTGTTGATTTTGATATGGTGGGACAAGGTACCGGTAATCCTAATGATGGTTACGAAGAACTAAATTTAAAATCGACTAAAGGTGACTTACAATTTTCTCCATGGATTTTAGTGCCACATATAGGGTATGGTTTTCATCAATACTTACCATATCCAGATGGTATGTCACCATTTCAAGCTGCAATGGTTGATGGATCAGGTTATCAAGTTCATAGAACAATGCAATTTGAAGATGGTGCTTCATTAACTGTTAATTATAGATACACATATGAAGGCTCACATATTAAAGGTGAAGCTCAAGTTAAAGGTACTGGTTTCCCAGCCGATGGCCCAGTTATGACAAATAGTTTAACAGCAGCAGATTGGTGTAGATCCAAAAAAACTTATCCAAATGATAAAACAATTATTTCAACTTTTAAATGGTCATATACAACCGGTAATGGTAAACGTTATCGTTCAACAGCCCGTACAACATATACTTTTGCTAAACCAATGGCAGCTAATTATTTAAAAAATCAACCAATGTATGTTTTTCGTAAAACAGAGTTAAAACATTCAAAAACAGAACTTAATTTTAAAGAATGGCAAAAAGCATTTACAGACGTTATGTAAGCTAGTAGTTAAATAAATAAATTATTTAATAAATAATAAAAAAACAAATTGTTGTAATAATCTAATATTTTCTTTTTTTTTTAATTTTTTTTTTTTAAATCTTAATAATTATTAAGTTATTTTAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTATCAAAAAAATCAAATATATTTAAAAAATTTATTATTTACAGATACATTTTGAATGGTGAAGATAAATATATGCATTAGATGTAAAACAGCCAAAGAGTATGAAAATCAAAAAGATAAAGCTTATCGATTTCGAAAAAGTAAATAGCAATTATTACAAAATTCAATCCGAATCTACCCAAATAAATTCCAATGAAATTGCCGATTTAAAAAAGTTTATTAAAGAAGAAGTCAATAAAACTTCTTCCAAAATTGATTTCTTTTTAGTTTCTTCAACAGATGCCCTTTCAAATCCAGAAAATTATTCTCTCTTAGAAGTAAAGTGTATTAATTGTCATTCTTTGTGTCAAGGAAAAAATTTATATATTTCATGTACAAGAGATGGATGTCAAAACAATATTTGCTATAATTGTTTAGGAATAAACATAAACATATATAATGTTGTTATTAATTCTAAACTTTGCCCTCCATGTTTCAATGATTCGGTAATCAACAAGAAGTGTGCCATGTGTAGTAAGAACGGAACTAAATGTAATTTGAACCAAGAATGTAAACTTCATCTTTGTGCACAGTGTTCTAAAAAGTGTCTATACATTCTGAGAGTCAAAACTAATTAAATAAAATATAAACTTAATTTCTAAATAAACTCATTTAAAAATATTTAAATAATATGAATTTATAACTGTAATTATTGTATTAAAAAATTATATAATTATTTAATGTTAAAAATGTATTAAAATAATTATAAAAAAATATAACAAAAATTTTCGTAAAAATAATTTGTAAAAAAGCTATTAAAAATATTATGAAAAAAAAATTAAAAAAATTATTAAATTGTTTTTGTAATTAAGCTATTAAAATAATTATAAAAAAAAAATTTTTAAAATTTTAAAAATATTTTTTGTAAAAAAGTATTAAAATAATTATGAAAAAAAAATTTTCTAAAAAATTAAAAAAAAAATTAAAATATATTTTATGTTAAAAACGTATTAAAATAACTATTAAAAAAATTATATTTAAAAAAGTATTAACTTTTTTTTAGGTGTGGTTGTGGGGTGGGGTTTAATATATTATAATAAAAAATTATTTTTTGTTCATTTATTATTTTCATTGTATATAATGTACTCAACAACGTTATTATTTTTTCTTTTTTTTTTTATTGTATCAAAATCTTCTGTTCTTCAAAATGATCAGATTGAAGTAAAATATTTTCAACTTCTTATTGTTATGTATCAAAAAGAAAACTGTGTTGAAAAGTCAATGACAGGCGCCGTAATTTATGATGAATGTAATATTCATGGAAGAGTTGAAACAAATAGTACTCATGCGCTTTTTTATGATGACATTGAAACAAATAATTCAAGATGTAACAATTTTCGTAATTTAACAAACTTAATTAAACTTAATGAATGTATTAATGACGAGTTTGGAGAGTCTATTCTTTATAAAGAATATAATGAAACTGATGATGGTTATTTGTTTAGAGTGGAAGACAGCTTTGTTGAAATTACTTCTCTTTCAATGGATTGTACAAAAAATAGTAAAACAATTATTGAAAAATTCAACATTTGTTCAAAATTTGAAAATGTATATCATATTACAAACATTACACAAGAGAAATCCAATAGATTTACATGTACAGATCCATTGTGCCACTATTGTAAGAATGAAAACATTCAAAACAATCTTGATTTTAAAACAACAAAGTGTACTCCAAAGTATGGTGCATCTGATTCTGAATTTTTATCAACAATTTACAATCCAAAGCTCGATGGCTCAAATAACGGTATGGAAAAGTCAGTAACTCAAGAAAAAAACATTTCAAATAATTTAAAAATTAATATATATTTAATTTTCTTTTTAATTATTTTTTTAATTAAATAAAGTTTTATTATTTTTTAAGAGTAATTATTGCTCTTTTTTCATTTGAAACACCAGAAGCTAAACGTAATTGTTGTTGACTGAAATTTTTTATTTTTTTTGGGGTAATAGGATTTCCTTTTTTATGAAGATTAATATCTTTGACTCGTGAAACATTCTTTTTAACTTTTGTTTTTTCTGTTGGTTTATCATTTGTTTTTTCACTAATTTCAATACCATCTTGACGTTCATTCATAACTTCATCTTTTTTTTTTCCTGTTTCTGTATCTTCTTCTATTTTTTTTTCTTTATCTTTTTCTTTATCTTCTTCTTGTTCTTCCTCTTCTTTTTCTTCTTCTGATACTGCAGGTGTTTCTTCTTCTTCTTCTTCCGATATTGTCGGTTTTTCTACTTCTTCTTCTTGTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCTTCCGGTAATTTATTAATTATATTTCTTTTTTTATATGAATTACGTTTGGTTTGTGCAGTAATTTCCTTACATAGAGTGCAGCTTTCAAGAAAAATTTCAATTTCTTCGTTTGTTGCATAATAACCACTGTCTTTGATATGATTAAACATTTTTGATTTTCTTAAATGCTTTCCTTCTTTAATATGAAAATTATCGAATTCTAATTCATTAAGAACAATAAGCTCCCCTAATTTAAAAAATTAGTTAAAATAAATTAAAATGAACATGTATAAAGATGGATTTTACCATTTTTTGAAATTCTAAATAACTTTTCTTCATCTCCAATCTTTTTGACTGAAAAACGATTTTTAATTGAAGTTATTGTTCTGTGAGTGTTTTGAATCGCCCATTTCTCTAAATCAGTTTGAGATAGTGTTTTATAATCTGAATTGTTATACACAACTTTTGCTCTATTAACCAAATATTTAAAGATTTCATCATCAACTGAATATTTTGACTTTACGATTCTTGTCCAAAAAACAATTTCTACTACTATCATTTTTTATTTATAAAATAATTTAAATACAAAAATGAATTTTTTTTTTTTTAAAAAAAAAAAAATTTGAAAAAAAAAAAAAAAAAATTTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATCAAATAAAAAGTAAAAAATAAAAACCGAAAAACATTCATTGTAATTTCAAATGTCGAGGCCGGCAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTTCCTCGCTCACTGACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACTCAAAGGCGGTAATACGGTTATCCACAGAATCAGGGGATAACGCAGGAAAGAACATGTGAGCAAAAGGCCAGCAAAAGGCCAGGAACCGTAAAAAGGCCGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCTCAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCGCTTTCTCATAGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAGGATTAGCAGAGCGAGGTATGTAGGCGGTGCTACAGAGTTCTTGAAGTGGTGGCCTAACTACGGCTACACTAGAAGGACAGTATTTGGTATCTGCGCTCTGCTGAAGCCAGTTACCTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTGGTAGCGGTGGTTTTTTTGTTTGCAAGCAGCAGATTACGCGCAGAAAAAAAGGATCTCAAGAAGATCCTTTGATCTTTTCTACGGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATCTTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGTCTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGTTGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCCCAGTGCTGCAATGATACCGCGAGACCCACGCTCACCGGCTCCAGATTTATCAGCAATAAACCAGCCAGCCGGAAGGGCCGAGCGCAGAAGTGGTCCTGCAACTTTATCCGCCTCCATCCAGTCTATTAATTGTTGCCGGGAAGCTAGAGTAAGTAGTTCGCCAGTTAATAGTTTGCGCAACGTTGTTGCCATTGCTACAGGCATCGTGGTGTCACGCTCGTCGTTTGGTATGGCTTCATTCAGCTCCGGTTCCCAACGATCAAGGCGAGTTACATGATCCCCCATGTTGTGCAAAAAAGCGGTTAGCTCCTTCGGTCCTCCGATCGTTGTCAGAAGTAAGTTGGCCGCAGTGTTATCACTCATGGTTATGGCAGCACTGCATAATTCTCTTACTGTCATGCCATCCGTAAGATGCTTTTCTGTGACTGGTGAGTACTCAACCAAGTCATTCTGAGAATAGTGTATGCGGCGACCGAGTTGCTCTTGCCCGGCGTCAATACGGGATAATACCGCGCCACATAGCAGAACTTTAAAAGTGCTCATCATTGGAAAACGTTCTTCGGGGCGAAAACTCTCAAGGATCTTACCGCTGTTGAGATCCAGTTCGATGTAACCCACTCGTGCACCCAACTGATCTTCAGCATCTTTTACTTTCACCAGCGTTTCTGGGTGAGCAAAAACAGGAAGGCAAAATGCCGCAAAAAAGGGAATAAGGGCGACACGGAAATGTTGAATACTCATACTCTTCCTTTTTCAATATTATTGAAGCATTTATCAGGGTTATTGTCTCATGAGCGGATACATATTTGAATGTATTTAGAAAAATAAACAAATAGGGGTTCCGCGCACATTTCCCCGAAAAGTGCCACCTGACGCGCCCTGTAGCGGGATCCATTTTATTTAATATACTAAATAATAAAAAAGTTAAAAAATGATCATTGGATAAATTTTTTATAATTATAAATAAAGATAATAATTTTTTTTTTAACAAAACTAAAAATAAAAATAATAAAATAATTGTTAAAATAGGTTTTTTTTTTTTTTTTTTTTTTTTAATAAATGGTATTTATTAATTTATTTGTTGTGTGTGTTTTTTTTTTTATAATATTTTTTTTTTTAGCATTGAATTAAGAAGAAATCAAATTGATGCGGCCGCTCAGAAGAACTCGTCAAGAAGGCGATAGAAGGCGATGCGCTGCGAATCGGGAGCGGCGATACCGTAAAGCACGAGGAAGCGGTCAGCCCATTCGCCGCCAAGCTCTTCAGCAATATCACGGGTAGCCAACGCTATGTCCTGATAGCGGTCCGCCACACCCAGCCGTCCACAGTCGATGAATCCAGAAAAGCGGCCATTTTCCACCATGATATTCGGCAAGCAGGCATCGCCATGGGTCACGACGAGATCCTCGCCGTCGGGCATGCGCGCCTTGAGCCTGGCGAACAGTTCGGCTGGCGCGAGCCCCTGATGCTCTTCGTCCAGATCATCCTGATCGACAAGACCGGCTTCCATCCGAGTACGTGCTCGCTCGATGCGATGTTTCGCTTGGTGGTCGAATGGGCAGGTAGCCGGATCAAGCGTATGCAGCCGCCGCATTGCATCAGCCATGATGGATACTTTCTCGGCAGGAGCAAGGTGAGATGACAGGAGATCCTGCCCCGGCACTTCGCCCAATAGCAGCCAGTCCCTTCCCGCTTCAGTGACAACGTCGAGCACAGCTGCGCAAGGAACGCCCGTCGTGGCCAGCCACGATAGCCGCGCTGCCTCGTCCTGCAGTTCATTCAGGGCACCGGACAGGTCGGTCTTGACAAAAAGAACCGGGCGCCCCTGCGCTGACAGCCGGAACACGGCGGCATCAGAGCAGCCGATTGTCTGTTGTGCCCAGTCATAGCCGAATAGCCTCTCCACCCAAGCGGCCGGAGAACCTGCGTGCAATCCATCTTGTTCAATCATGCGAAACGATCCAGCTTGAACATCTTCACCATCCATTTTTTGCTAGCTGTGAAATTAGTTTAAAATACAAATAAAGAGTTATAATAATATACAGTTGAATAAAAAAAAAAAAAATGAATTGGAAAATTTATTTTTATATGAAGAAAAAAAAATTTTGAAAAAAAAAAAAAAATTAAAAAAAAAAAAAAAAAAAAAAAAATTTAAAATTATTCCACTGTGGGGGGCCCCAAATTTTATTTAAAAAAAAAAAAAAAATGGGTCCCTTTTGGGGGGTTGGAAAAAAAAAAAAAAAAAAAAAAAAAAATTGAAAATATAATGTTAGTCATATGATTAATCATT"
So using this string as an example, from this sequence I need
ATGGTTTAAAAAAACTTGGGTTGGTTAATTATTATTTGA
So substrings I want (again they can be of any length) will have the beginning and ends defined but the body of the substring is not known : ATG.....................TGA
I've tried using this code, however the substrings it generates aren't actually part of the sequence
def findProteins(dnaString)
atg = ""
tga = ""
between_chars = ""
sub_string_list = []
# ATG to TGA
for i in 0...dnaString.length
if dnaString[i] == "A" and dnaString[i+1] == "T" and dnaString[i+2] == "G"
atg = "ATG"
i = i + 2
elsif dnaString[i] == "T" and dnaString[i+1] == "G" and dnaString[i+2] == "A"
tga = "TGA"
sub_string_list.append(atg + between_chars + tga)
atg = ""
tga = ""
between_chars = ""
else
between_chars += dnaString[i]
sub_string_list.each do |substring|
if substring[0..2] != "ATG"
sub_string_list.delete(substring)
end # end to each do if block
end # end to each do block
end # end to if elsif else block
end # end to for loop
tta = ""
tag = ""
tta_between_chars = ""
print sub_string_list
return sub_string_list
end
findProteins(s)
Is there a more concise way to get what I want, maybe using regex?
Thanks!
UPDATE:
I looked at the answers given, but I realize that what I need isn't to just start the substring at any index of the string. Rather, I need the substrings to start at after every 3 characters. Because This is how your cells read DNA is in substrings with length of 3 characters (i.e., a codon) so,
CTC GAG ACT AGA GCT AGA TAA AAA AAA TTT TTA TTT ATT TTT ATT TAT TTT GAA TTA AAT AGA TTA CAA ATT AAT TAA TCC CAT CAA ATC TTT AAA
thus with codons and there indices in the string listed we would have
CTC:0 GAG:3 ACT:6 AGA:9 TAA:12 AAA:15 TTT:18 TTA:21 etc,etc,etc
would be the ideal format. With that being said I think my best options is to use dnaString.scan(/.../)
to break up the string into substrings of three, split the array into smaller arrays and then joining the smaller arrays into one whole substring. So in order to keep the index origin points for each substring right the concept of the code would look something vaguely like
codon_array = dnaString.scan(/.../)
smaller_arrays = codon_array[codon_array.index("ATG")..codon_array.index("TGA" or TAA" or "TAG")]
#with the end point of each smaller array being which ever end point (TGA, TAA, TAG) comes first in the index of the array)
substrings = smaller_arrays.join()
I think I would need a loop to get a complete list of substrings
Thanks again the answers are helpful and I appreciate the opportunity to learn.