Calculate percent identity between two already-aligned sequences

Question

I have two strings in a file like this:

>1
atggca---------gtgtggcaatcggcacat
>2
atggca---------gtgtggcaatcggcacat

Using the AlignIO function in Biopython:

from Bio import AlignIO
print AlignIO.read("neighbor.fas", "fasta")

returns this:

SingleLetterAlphabet() alignment with 2 rows and 33 columns
atggca---------gtgtggcaatcggcacat 1
atggca---------gtgtggcaatcggcacat 2

I want to calculate the percentage identity between the two rows in this alignment.

row = align[:,n]

allows for the extraction of individual columns that can be compared.

Columns that contain only "-" should not be counted.

http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script — Daenyth, Apr 28 '13 at 19:17
Did you tried to access elements of align directly (`align[row, column]`) instead of creating a 1x2 matrix for each column? — Chris Medrela, Apr 28 '13 at 19:45
BTW It looks like `row` is actually a column. And you should use `r` instead of `row` inside inner for loop. — Chris Medrela, Apr 28 '13 at 19:46
If the strings you're comparing might be shifted, then you may want to compare their [Levenshtein distances](http://en.wikipedia.org/wiki/Levenshtein_distance) - you'll probably be able to find Python code for this. — Zim-Zam O'Pootertoot, Apr 28 '13 at 22:26
IMO it's not a bioinformatics problem but only a string problem (you try to find the distance between two strings). You might take a look at this http://stackoverflow.com/questions/17388213/python-string-similarity-with-probability — Emilien, Feb 02 '16 at 13:27

Tim · Answer 1 · 2014-01-20T17:29:23.760

This is a fast but not biologically accurate answer.

use Levenshtein Python extension and C library.

http://code.google.com/p/pylevenshtein/

The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.

Since these sequences are strings, why not!

sudo pip install python-Levenshtein

then fire up ipython:

In [1]: import Levenshtein

In [3]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacat'.replace('-','')) * 100
Out[3]: 100.0

In [4]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacaa'.replace('-','')) * 100
Out[4]: 95.83333333333334

This doesn't work. E.g. `Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''), 'atggca---------gtgtggcaatcggcacta'.replace('-','')) * 100` gives 95.834 instead of 91.667 because there are two substitutions. Seems like Levenstein distance allows transposition of adjacent characters? Unsure. — blehblehblecksheep, May 28 '22 at 21:35

score 1 · Answer 2 · answered Aug 25 '16 at 13:21

If you wanted to extend it to more than two sequences the following works well, it gives the same result as the BioPerl overall_percentage_identity function (http://search.cpan.org/dist/BioPerl/Bio/SimpleAlign.pm).

from Bio import AlignIO

align = AlignIO.read("neighbor.fas", "fasta")
print perc_identity(align)

def perc_identity(aln):
    i = 0
    for a in range(0,len(aln[0])):
        s = aln[:,a]
        if s == len(s) * s[0]:
            i += 1
    return 100*i/float(len(aln[0]))

This is very cool code. I can tell you're from a Perl background (minimalist). Its the len(s) * s[0] that's slick, should be fast. — M__, Jan 25 '22 at 03:42

score -1 · Answer 3 · answered Feb 02 '16 at 13:13

-1

I know this question is old but, since you are already in biopython, couldn't you just move along with the BLAST record class (chapter 7 of the tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html)?

I believe the option you need (under "7.4 The BLAST record class") is "hsp.identities".

answered Feb 02 '16 at 13:13

Cat

1
4

please provide some explanation with your answer. – Milad Faridnia Feb 02 '16 at 13:33

Calculate percent identity between two already-aligned sequences

3 Answers3