1

I am using the Python package ete3. I have trees such as:

((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);

I would like to see the most closely related leaf to a specific node in the tree (here the tree is Species1_order1). In the example, the most closely related leaves are Species2_order2/ Species3_order2, and Species4_order3/Species5_order5.

Code:

tree = ete3.Tree('((Species1_order1, \
                    (Species2_order2, Species3_order2)), \
                   Species4_order3, Species5_order5);')

New example :

    tree=ete3.Tree('((((((A,B),C),D),(E,F)),G),(H,I));')

The result I get is :

         A    B    C    D    E    F    G    H    I
    A  0.0  2.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    B  2.0  0.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    C  3.0  3.0  0.0  3.0  5.0  5.0  5.0  7.0  7.0
    D  4.0  4.0  3.0  0.0  4.0  4.0  4.0  6.0  6.0
    E  6.0  6.0  5.0  4.0  0.0  2.0  4.0  6.0  6.0
    F  6.0  6.0  5.0  4.0  2.0  0.0  4.0  6.0  6.0
    G  6.0  6.0  5.0  4.0  4.0  4.0  0.0  4.0  4.0
    H  8.0  8.0  7.0  6.0  6.0  6.0  4.0  0.0  2.0
    I  8.0  8.0  7.0  6.0  6.0  6.0  4.0  2.0  0.0

But for instance E and F have an equaly distance to A,B,C and D in the tree and in the result they appear to be clother to D.

A good matrix result should rather be :


        A   B   C   D   E   F   G   H   I
    A   0   1   2   3   4   4   5   6   6
    B   1   0   2   3   4   4   5   6   6
    C   2   2   0   3   4   4   5   6   6
    D   3   3   3   0   4   4   5   6   6
    E   4   4   4   4   0   1   5   6   6
    F   4   4   4   4   1   0   5   6   6
    G   5   5   5   5   5   5   0   6   6
    H   6   6   6   6   6   6   6   0   1
    I   6   6   6   6   6   6   6   1   0

is not it ?

Z4-tier
  • 7,287
  • 3
  • 26
  • 42
Grendel
  • 783
  • 4
  • 12
  • Post the code that shows how you are building this tree. Also the last node in your example is missing a `Species` number. – Z4-tier Feb 27 '20 at 13:33
  • I think the computed matrix is correct. Try using `tree.show()` to render the tree graphically. `D` is 4 nodes away from both `E` and `F`, 3 away from `C`, and 4 away from both `A` and `B`. – Z4-tier Mar 02 '20 at 13:16
  • Yes but from an evolutionary point of view `D` is just as far from `A,B C` and it is closer to these 3 than it is to `E` and `F` for example. I'm not trying to count then number of node that separate each leaf but only to get an evolutionary relationship matrix. – Grendel Mar 02 '20 at 13:37
  • There are several ways to define a distance metric on a phylogenetic tree. The one I gave is based on topological distance which is generally the most the most common. Looking at the example matrix, I am not sure how the distances are being computed. `ete3` does support several other types of tree (ClusterTree, EvolTree, NexmlTree, and PhyloTree) but I think for the example they will all give the same results. – Z4-tier Mar 03 '20 at 14:01

2 Answers2

2

As discussed in the comments, ete3 gives us a function called Tree.get_closest_leaf, but it's output is not what is expected (and I am not sure what this value even represents here):

>>> t=ete3.Tree('((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);')
>>> t.get_closest_leaf('Species2_order2')
(Tree node 'Species4_order3' (0x115b2f29), 0.0)

Instead, you can get the node distance like this:

import ete3
import pandas as pd

def make_matrix(tree):
    def get_root_path(node):
        root_path = [node]
        if node.up:
            root_path.extend(get_root_path(node.up))
        return root_path
    leaves = tree.get_leaves()
    leaf_ct = len(leaves)
    paths = {node.name: set(get_root_path(node)) for node in leaves}
    col_lbls = [leaf.name for leaf in leaves]
    dist_matrix = pd.np.array([pd.np.zeros(leaf_ct)] * leaf_ct)
    df = pd.DataFrame(dist_matrix, index=col_lbls, columns=col_lbls)
    for node1_name, col in df.iteritems():
        for node2_name in col.keys():
            path = paths[node2_name].symmetric_difference(paths[node1_name])
            dist = sum(node.dist for node in path)
            df.at[node1_name, node2_name] = dist
            df.at[node2_name, node1_name] = dist
    return df

Note: This is a suboptimal solution for several reasons, but this question is not asking for the most most efficient solution. see this link for much more information about phylogenetic distance matrix methods.

This solution also uses pandas which is overkill, since it is really just for the convenience of row/column labels. It would not be difficult to remove the pandas dependencies and do it with native lists instead.

Here is the output:

>>> tree=ete3.Tree('((Species1_order1, (Species2_order2, Species3_order2)), Species4_order3, Species5_order5);')
>>> make_matrix(tree)
                 Species1_order1  Species2_order2  Species3_order2  Species4_order3  Species5_order5
Species1_order1              0.0              3.0              3.0              3.0              3.0
Species2_order2              3.0              0.0              2.0              4.0              4.0
Species3_order2              3.0              2.0              0.0              4.0              4.0
Species4_order3              3.0              4.0              4.0              0.0              2.0
Species5_order5              3.0              4.0              4.0              2.0              0.0

For the updates posted, I am not seeing anything wrong. It appears to give correct results. Here is the tree as rendered by ete3 (I highlighted the 4 hops that are counted in the distance from Interest_sequence to Rhopalosiphum_maidis_Hemiptera):

and here is the matrix column for Interest_sequence that corresponds to it:

>>> m['Interest_sequence']
Rhopalosiphum_maidis__Hemiptera            4.0
Drosophila_novamexicana__Hemiptera         5.0
Drosophila_arizonae__Hemiptera             6.0
Drosophila_navojoa__Hemiptera              6.0
Interest_sequence                          0.0
Heliothis_virescens_droso_3a__nan          5.0
Mythimna_separata_droso__nan               6.0
Heliothis_virescens_droso_3i__nan          6.0
Scaptodrosophila_lebanonensis__Diptera     5.0
Mythimna_unipuncta_droso_A__nan            6.0
Xestia_c-nigrum_droso__nan                 8.0
Helicoverpa_armigera_droso__nan            8.0
Mocis_latipes_droso__nan                   7.0
Drosophila_busckii__Diptera                4.0
Drosophila_bipectinata__Diptera            5.0
Drosophila_mojavensis__Diptera             7.0
Drosophila_yakuba__Diptera                 7.0
Drosophila_hydei__Diptera                  7.0
Drosophila_serrata__Diptera                8.0
Drosophila_takahashii__Diptera             9.0
Drosophila_eugracilis__Diptera            11.0
Drosophila_ficusphila__Diptera            11.0
Drosophila_erecta__Diptera                12.0
Drosophila_melanogaster__Diptera          13.0
Sequence_A_nan__nan                       14.0
Drosophila_sechellia__Diptera             15.0
Drosophila_simulans__Diptera              15.0
Drosophila_suzukii__Diptera               12.0
Drosophila_biarmipes__Diptera             12.0
Name: Interest_sequence, dtype: float64
Z4-tier
  • 7,287
  • 3
  • 26
  • 42
  • But the closest leaf of Species2_order2 is Species3_order2 no? @Z4-tier – Grendel Feb 28 '20 at 12:51
  • @Grendel You are correct. That is strange. I'm not even sure how to interpret that output... At any rate, this isn't too hard to write from scratch. I updated this answer with a method to compute the distance matrix. – Z4-tier Feb 29 '20 at 22:01
  • I updated my question to add a new exemple, there is something wrong in the code ... – Grendel Mar 02 '20 at 13:03
  • But for instance : `Interest_sequence` is as far from `Heliothis_virescens_droso_3a__nan` as it is from `Mythimna_separata_droso__nan` and yet the matrix says it is closer to `Heliothis_virescens_droso_3a__nan` compared to `Mythimna_separata_droso__nan`. – Grendel Mar 02 '20 at 13:31
0

The ete3 function works as expected, you just need to iterate over the nodes of the trees as follows:

from ete3 import Tree t = Tree() # Creates an empty tree 
A = t.add_child(name="A") # Adds a new child to the current tree root and returns it 
B = t.add_child(name="B") # Adds a second child to the current tree root and returns it 
C = A.add_child(name="C") # Adds a new child to one of the branches D =
C.add_sister(name="D") # Adds a second child to same branch as before, but using a sister as the starting point 
R = A.add_child(name="R") # Adds a third child to the branch. Multifurcations are supported
# Next, I add 6 random leaves to the R branch names_library is an optional argument. If no names are provided, they will be generated randomly. 
R.populate(6, names_library=["r1","r2","r3","r4","r5","r6"])
for node in t.traverse():
    print(node.get_closest_leaf())

NOTE: if the internal node is a terminal node, it will have 2 closest leaves, get them with node.get_children()

Hack5
  • 3,244
  • 17
  • 37
  • 1
    Welcome to Stack Overflow! You should avoid putting `>` in your code, since it makes it hard to copy-paste. I've edited it to fix the formatting. – Hack5 Mar 22 '21 at 17:43
  • @Hack5 My bad, I had a mess formatting the code. I don't know what happened hehehe – duckworking Mar 23 '21 at 11:04