how to determine/ find the longest poly-purine tract in any genome (consecutive As and Gs with no interspersed C or T, or vice versa) and this needs to be on the E. coli genome . is it to figure out the polypurine tract and then figure out the longest chain ? or is it to splice the introns and exons away from the DNA ? since E. coli's genome is 4.6 million BP long, i need some help in breaking this down ?
-
1This question appears to be off-topic because it is about biology – Codeman Aug 08 '14 at 20:44
-
where should this be posted otherwise ? – user3923728 Aug 08 '14 at 20:48
-
http://biology.stackexchange.com/ – Codeman Aug 08 '14 at 20:49
-
@Pheonixblade9, We have asked users of scikit-bio to post to Stack Overflow using the skbio tag and feel this is the most appropriate venue given the aims of the package, as well as its reliance on Pandas, Python, numpy, scipy, Cython, and soon scikit-learn. – daniel Aug 10 '14 at 00:43
-
@user3923728, That is a good question. I'm not aware of a way to convert a `NucleotideSequence` to purine/pyrimidine yet but I just posted an [issue](https://github.com/biocore/scikit-bio/issues/611) on the scikit-bio tracker about it. One of the devs will follow up with a specific answer soon. I do recommend bouncing the biological motivation for doing this to [biology.stackexchange.com](http://biology.stackexchange.com) as recommended by @Pheonixblade9 – daniel Aug 10 '14 at 00:53
2 Answers
I agree that the methodological aspects of this question are better suited for https://biology.stackexchange.com/ (i.e., should introns/exons be removed, etc), but briefly that depends entirely on the biological question that you're trying to answer. If you care about whether those stretches span intron/exon boundaries, then you should not split them first. However I'm not sure that's relevant to E. coli sequences as (as far as I know) introns and exons are specific to eukaryotes.
To address the technical aspect of this question, here's some code that illustrates how you could do this using scikit-bio. (I also posted this as a scikit-bio cookbook recipe here.)
from __future__ import print_function
import itertools
from skbio import parse_fasta, NucleotideSequence
# Define our character sets of interest. We'll define the set of purines and pyrimidines here.
purines = set('AG')
pyrimidines = set('CTU')
# Obtain a single sequence from a fasta file.
id_, seq = list(parse_fasta(open('data/single_sequence1.fasta')))[0]
n = NucleotideSequence(seq, id=id_)
# Define a ``longest_stretch`` function that takes a ``BiologicalSequence`` object and the characters of interest, and returns the length of the longest contiguous stretch of the characters of interest, as well as the start position of that stretch of characters. (And of course you could compute the end position of that stretch by summing those two values, if you were interested in getting the span.)
def longest_stretch(sequence, characters_of_interest):
# initialize some values
current_stretch_length = 0
max_stretch_length = 0
current_stretch_start_position = 0
max_stretch_start_position = -1
# this recipe was developed while reviewing this SO answer:
# http://stackoverflow.com/a/1066838/3424666
for is_stretch_of_interest, group in itertools.groupby(sequence,
key=lambda x: x in characters_of_interest):
current_stretch_length = len(list(group))
current_stretch_start_position += current_stretch_length
if is_stretch_of_interest:
if current_stretch_length > max_stretch_length:
max_stretch_length = current_stretch_length
max_stretch_start_position = current_stretch_start_position
return max_stretch_length, max_stretch_start_position
# We can apply this to find the longest stretch of purines...
longest_stretch(n, purines)
# We can apply this to find the longest stretch of pyrimidines...
longest_stretch(n, pyrimidines)
# Or the longest stretch of some other character or characters.
longest_stretch(n, set('N'))
# In this case, we try to find a stretch of a character that doesn't exist in the sequence.
longest_stretch(n, set('X'))

- 1
- 1

- 444
- 4
- 11
There is now a method in (the development version of) scikit-bio for the BiologicalSequence
class called (and subclasses) find_features
. For example
my_seq = DNASequence(some_long_string)
for run in my_seq.find_features('purine_run', min_length=10):
print run
or
my_seq = DNASequence(some_long_string)
all_runs = list(my_seq.find_features('purine_run', min_length=10))

- 31
- 2