Biopython gives ValueError: Sequences must all be the same length even though sequences are of the same length

Question

I'm trying to create a phylogenetic tree by making a .phy file from my data.

I have a dataframe

ndf=

ESV trunc

1 esv1 TACGTAGGTG...

2 esv2 TACGGAGGGT...

3 esv3 TACGGGGGG...

7 esv7 TACGTAGGGT...

I checked the length of the elements of the column "trunc":

length_checker = np.vectorize(len)

arr_len = length_checker(ndf['trunc'])

The resulting arr_len gives the same length (=253) for all the elements.

I saved this dataframe as .phy file, which looks like this:

23 253

esv1 TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG

esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG

esv3 TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG

This is similar to the file used in this tutorial.

However, when I run the command aln = AlignIO.read('msa.phy', 'phylip')

I get "ValueError: Sequences must all be the same length"

I don't know why I'm getting this or how to fix it. Any help is greatly appreciated!

Thanks

M__ · Answer 1 · 2020-02-19T19:10:06.757

Generally phylip is the fiddliest format in phylogenetics between different programs. There is strict phylip format and relaxed phylip format etc ... t is not easy to know which is the separator being used, a space character and/or a carriage return.

I think that you appear to have left a space between the name of the taxon (i.e. the sequence label) and sequence name, viz.

2. esv2

Phylip format is watching for the space between the label and the sequence data. In this example the sequence would be 3bp long. The use of a "." is generally not a great idea as well. The integer doesn't appear to denote a line number.

The other issue is you could/should try keeping the sequence on the same line as the label and remove the carriage return, viz.

esv2 TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG

Sometimes a carriage return does work (this could be relaxed phylip format), the traditional format uses a space character " ". I always maintained a uniform number of spaces to preserve the alignment ... not sure if that is needed.

Note if you taxon name exceeeds 10 characters you will need relaxed phylip format and this format in any case is generally a good idea.

The final solution is all else fails is to convert to fasta, import as fasta and then convert to phylip. If all this fails ... post back there's more trouble-shooting

Fasta format removes the "23 254" header and then each sequence looks like this,

>esv2
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG

There is always a carriage return between ">esv2" and the sequence. In addition, ">" is always present to prefix the label (taxon name) without any spae. You can simply convert via reg-ex or "re" in Python. Using a perl one-liner it will be s/^([az]+[0-9]+)/>$1/g type code. I'm pretty sure they'll be an online website that will do this.

You then simply replace the "phylip" with "fasta" in your import command. Once imported you ask BioPython to convert to whatever format you want and it should not have any problem.

Thank you. I tried removing the "2." in "2. esv2", but I get the same error. I tried only removing the "." in "2. esv2", but I still get the same error. I did not do a carriage return, it just prints it that way because of the page size limits. The taxon name does not exceed 10 characters. The longest is 6 characters long. How do I convert the .phy file to fasta? — Kaumz, Feb 19 '20 at 17:23

BioGeek · Accepted Answer · 2020-02-19T21:05:09.347

First, please read the answer to How to make good reproducible pandas examples. In the future please provide a minimal reproducibl example.

Secondly, Michael G is absolutely correct that phylip is a format that is very peculiar about its syntax.

The code below will alow you to generate a phylogenetic tree from your Pandas dataframe.

First some imports and let's recreate your dataframe.

import pandas as pd
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO

data = {'ESV' : ['esv1', 'esv2', 'esv3'],
        'trunc': ['TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG',
                 'TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG',
                 'TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG']

}

ndf = pd.DataFrame.from_dict(data)
print(ndf)

Output:

    ESV                                              trunc
0  esv1  TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCG...
1  esv2  TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTG...
2  esv3  TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCG...

Next, write the phylip file in the correct format.

with open("test.phy", 'w') as f:
    f.write("{:10} {}\n".format(ndf.shape[0], ndf.trunc.str.len()[0]))
    for row in ndf.iterrows():
        f.write("{:10} {}\n".format(*row[1].to_list()))

Ouput of test.phy:

         3 253
esv1       TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAAACTTGAGTGCAGAAGAGGAAAGCGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
esv2       TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGTGGGTTGGTAAGTCAGTGGTGAAATCTCCGAGCTTAACTTGGAAACTGCCATTGATACTATTAATCTTGAATATTGTGGAGGTTAGCGGAATATGTCATGTAGCGGTGAAATGCTTAGAGATGACATAGAACACCAATTGCGAAGGCAGCTGGCTACACATATATTGACACTGAGGCACGAAAGCGTGGGGATCAAACAGG
esv3       TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCCAGACCAAGTCGAGTGTGAAATTGCAGGGCTTAACTTTGCAGGGTCGCTCGATACTGGTCGGCTAGAGTGTGGAAGAGGGTACTGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGCGGCGAAGGCGGGTACCTGGGCCAACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACAG

Now we can start with the creation of our phylogenetic tree.

# Read the sequences and align
aln = AlignIO.read('test.phy', 'phylip')
print(aln)

Output:

SingleLetterAlphabet() alignment with 3 rows and 253 columns
TACGTAGGTGGCGAGCGTTATCCGGAATTATTGGGCGTAAAGCG...AGG esv1
TACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGG...AGG esv2
TACGGGGGGGGCAAGCGTTGTTCGGAATTACTGGGCGTAAAGGG...CAG esv3

Calculate the distance matrix:

calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)
print(dm)

Output:

esv1    0
esv2    0.3003952569169961  0
esv3    0.6086956521739131  0.6245059288537549  0

Construct the phylogenetic tree using UPGMA algorithm and draw the tree in ascii

constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
Phylo.draw_ascii(tree)

Output:

  ________________________________________________________________________ esv3
_|
 |                                     ___________________________________ esv2
 |____________________________________|
                                      |___________________________________ esv1

Or make a nice plot of the tree:

Phylo.draw(tree)

Output:

Thank you! This worked. It must have been a formatting issue. For other readers, the only change I had to make in the above code was: `f.write("{:10} {}\n".format(*row[1].to_list()))` changed to `f.write("{:10} {}\n".format(*row[1].tolist()))` — Kaumz, Feb 19 '20 at 18:05

Biopython gives ValueError: Sequences must all be the same length even though sequences are of the same length

2 Answers2