This is a good question, and has to do with differences in how alignments are scored in scikit-bio and PyCogent. By default, in scikit-bio, terminal gaps are not penalized as that can lead to some very strange alignments. This issue was briefly discussed here and illustrated here (see the last cell of the notebook).
If you want to achieve a solution more similar to the one in PyCogent, you can pass penalize_terminal_gaps=True
to global_pairwise_align_nucleotide
as follows:
alignment = nw_align_scikit(seq_1, seq_2, penalize_terminal_gaps=True)
al_1, al_2 = [alignment.get_seq(_id).__str__() for _id in alignment.ids()]
print " nw alignment using scikit:"
print " %s" % al_1
print " %s" % al_2
output:
nw alignment using scikit:
ATCG--ATCGATCG
ATCGATATCGATCG
You'll notice that the alignment is still not identical to the one you get from PyCogent, but this is a minor implementation difference: the two resulting alignments have the same score (the difference is in whether the --
aligns to the first AT
or the second AT
in the ATAT
repeat), and the two implementations make a different choice in how they break that tie.
If you go back to the alignment that you posted (the default from scikit-bio), what you'll notice is that the alignment that is returned is very good - in fact, it's the optimally scoring alignment if not penalizing the terminal gaps (by definition, because the optimal scoring alignment is what it returns). However, you're right that it's strange, because the alignment that scikit-bio returns in this specific case is probably not the most biologically relevant alignment. If you know that your sequences all start at the same position and end at the same position, you could pass penalize_terminal_gaps=True
. However, yours is a toy example and in most cases with real sequences, I think the most biologically relevant alignment would be returned with the default parameters.