2

I have two strings in a file like this:

>1
atggca---------gtgtggcaatcggcacat
>2
atggca---------gtgtggcaatcggcacat

Using the AlignIO function in Biopython:

from Bio import AlignIO
print AlignIO.read("neighbor.fas", "fasta")

returns this:

SingleLetterAlphabet() alignment with 2 rows and 33 columns
atggca---------gtgtggcaatcggcacat 1
atggca---------gtgtggcaatcggcacat 2

I want to calculate the percentage identity between the two rows in this alignment.

row = align[:,n]

allows for the extraction of individual columns that can be compared.

Columns that contain only "-" should not be counted.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Stylize
  • 1,058
  • 5
  • 16
  • 32

3 Answers3

4

This is a fast but not biologically accurate answer.

use Levenshtein Python extension and C library.

http://code.google.com/p/pylevenshtein/

The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.

Since these sequences are strings, why not!

sudo pip install python-Levenshtein

then fire up ipython:

In [1]: import Levenshtein

In [3]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacat'.replace('-','')) * 100
Out[3]: 100.0

In [4]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacaa'.replace('-','')) * 100
Out[4]: 95.83333333333334
Tim
  • 2,134
  • 3
  • 26
  • 40
  • This doesn't work. E.g. `Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''), 'atggca---------gtgtggcaatcggcacta'.replace('-','')) * 100` gives 95.834 instead of 91.667 because there are two substitutions. Seems like Levenstein distance allows transposition of adjacent characters? Unsure. – blehblehblecksheep May 28 '22 at 21:35
1

If you wanted to extend it to more than two sequences the following works well, it gives the same result as the BioPerl overall_percentage_identity function (http://search.cpan.org/dist/BioPerl/Bio/SimpleAlign.pm).

from Bio import AlignIO

align = AlignIO.read("neighbor.fas", "fasta")
print perc_identity(align)

def perc_identity(aln):
    i = 0
    for a in range(0,len(aln[0])):
        s = aln[:,a]
        if s == len(s) * s[0]:
            i += 1
    return 100*i/float(len(aln[0]))
Teri Forey
  • 49
  • 3
  • This is very cool code. I can tell you're from a Perl background (minimalist). Its the len(s) * s[0] that's slick, should be fast. – M__ Jan 25 '22 at 03:42
-1

I know this question is old but, since you are already in biopython, couldn't you just move along with the BLAST record class (chapter 7 of the tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html)?

I believe the option you need (under "7.4 The BLAST record class") is "hsp.identities".

Cat
  • 1
  • 4